Winboard Forum

by **Tord Romstad** » 10 Feb 2006, 21:28

Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Writing the SMP version was a much easier and less time-consuming task than I had imagined. I spent a couple of hours last weekend eliminating global variables and adding new parameters to a few functions, and another couple of hours last night writing and debugging the parallell search itself (a simple implementation of YBWC). The parallell search speedup seems reasonably good, in the few test positions I have tried so far I get an average speedup somewhere around 1.7-2.0. How well it works in games remains to be seen. I am sure there are lots of unexpected bugs and problems waiting to be found.

A few questions to the more experienced SMP programmers:

I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?
If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?
The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.

Tord

by **mridul** » 11 Feb 2006, 08:29

Hi Tord,

Tord Romstad wrote:Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Congrats !

Writing the SMP version was a much easier and less time-consuming task than I had imagined. I spent a couple of hours last weekend eliminating global variables and adding new parameters to a few functions, and another couple of hours last night writing and debugging the parallell search itself (a simple implementation of YBWC). The parallell search speedup seems reasonably good, in the few test positions I have tried so far I get an average speedup somewhere around 1.7-2.0. How well it works in games remains to be seen. I am sure there are lots of unexpected bugs and problems waiting to be found.

A few questions to the more experienced SMP programmers:

Not really experienced , but I can give it a shot...

I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?

I dont split at the root too ... YMMV ofcourse.

If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?

Just to get an idea of what you are doing , consider this case :
* Processor 1 is searching at ply 2
* It decides that other processors can search this ply's subtree.
* Processor 2 starts searching a move from this subtree.
* Before processor 2 completes , all other moves are exhausted by processor 1 at that ply.

What do you do in this case ?
I guess the answer to your question will depend on this ...

The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord

There is a lockless way to access transposition table by Tim Mann (I think) - really useful :-)

I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

Hope this helps,
Regards,
Mridul

by **Tord Romstad** » 11 Feb 2006, 14:11

Hi Mridul,

Thanks for your reply!

mridul wrote:Hi Tord,

Tord Romstad wrote:Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Congrats !

A bit too early to congratulate, I'm afraid. There are still a few bugs. I have seen the program crash a couple of times, and I have seen at least one mate announcement I don't quite believe in (posted at the CCC). The parallel search seems to be 99% correct, but I fear that getting it 100% correct will take a lot of time. I also suspect that going from 2 to n CPUs will prove far more difficult than going from 1 to 2 CPUs.

By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?

A few questions to the more experienced SMP programmers:

Not really experienced , but I can give it a shot...
I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?

I dont split at the root too ... YMMV ofcourse.

OK, thanks.

If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?

Just to get an idea of what you are doing , consider this case :
* Processor 1 is searching at ply 2
* It decides that other processors can search this ply's subtree.
* Processor 2 starts searching a move from this subtree.
* Before processor 2 completes , all other moves are exhausted by processor 1 at that ply.

What do you do in this case ?

Processor 1 enters an idle loop, and waits for processor 2 to complete its work. While processor 1 is idle, processor 2 can give it work to do inside the sub-tree it is searching. The idea is called the "helpful master" concept in the YBW papers, I think.

I guess the answer to your question will depend on this ...

Why? Perhaps I am dense, but I cannot see how it has any relevance to my question.

The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord

There is a lockless way to access transposition table by Tim Mann (I think) - really useful

Yes, I have seen that (but I thought it was by Bob Hyatt), and intend to use it for my main transposition table.

I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Just what I do, then.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

In the case of the pawn hash table, you are probably right. My evaluation hash table is much less effective when I don't share it between threads, though.

Hope this helps,

It does. Thanks for the input!

Tord

by **mridul** » 11 Feb 2006, 14:42

Hi Tord,

Tord Romstad wrote:Hi Mridul,

Thanks for your reply!
mridul wrote:Hi Tord,

Tord Romstad wrote:Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Congrats !

A bit too early to congratulate, I'm afraid. There are still a few bugs. I have seen the program crash a couple of times, and I have seen at least one mate announcement I don't quite believe in (posted at the CCC). The parallel search seems to be 99% correct, but I fear that getting it 100% correct will take a lot of time. I also suspect that going from 2 to n CPUs will prove far more difficult than going from 1 to 2 CPUs.

Until I got it 100% correct (I think it is) I had all sorts of weird bugs - shouldn't be too tough to notice.
Like , your main pv/score will be different between single proc and parallel version : theoretically , this is possible to happen ... but in practise should happen extremely rarely.
Even now , I just need to recompile with -DMP_DEBUG to test that parallel search is not broken.
All preconditions and postconditions where multiple processes interact (be it in locks or sharing data , etc) are tested for the assumption made while entry and exit time.
An example of a class of bugs which I found through this which was totally unintutive (atleast to me) was not setting a variable as 'volatile' when it should be...

By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?

Well , the major problem with simulating MP search on a single processor is that the amount of concurrency you can simulate is dependent on the granuality of the OS context switch (we usually dont block on IO in chess engines) : and since this granuality is in order of milliseconds , we almost never notice any concurrency problems.

Even on a dual , this is markedly changed : you may not identify all the concurrency bugs , but the unpredictability of a context switch is much higher.
Before I tested on a quad , I just ran 4 process search on my dual and did find a bug (a minor one , thankfully

) ... did not catch any on the quad.
Ofcourse , having more number of processors to test on is always helpful :-D

When you do this , the nps will suck , the scaling will suck - but the parallel will get validated to some extent until you find a quad or higher (I use spinlock type busy wait while searching for a split point in the children).

A few questions to the more experienced SMP programmers:

Not really experienced , but I can give it a shot...
I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?

I dont split at the root too ... YMMV ofcourse.

OK, thanks.
If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?

Just to get an idea of what you are doing , consider this case :
* Processor 1 is searching at ply 2
* It decides that other processors can search this ply's subtree.
* Processor 2 starts searching a move from this subtree.
* Before processor 2 completes , all other moves are exhausted by processor 1 at that ply.

What do you do in this case ?

Processor 1 enters an idle loop, and waits for processor 2 to complete its work. While processor 1 is idle, processor 2 can give it work to do inside the sub-tree it is searching. The idea is called the "helpful master" concept in the YBW papers, I think.
I guess the answer to your question will depend on this ...

Why? Perhaps I am dense, but I cannot see how it has any relevance to my question.

The reason why I asked is to find out how you search , so that I will not say some stupid suggestion :-)

If you dont split at root , and the initial thread which started search continues to be the owner of the search , then cant you not always return the currmove of that thread ? (and the pvline until it splits under a child).

The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord

There is a lockless way to access transposition table by Tim Mann (I think) - really useful

Yes, I have seen that (but I thought it was by Bob Hyatt), and intend to use it for my main transposition table.
I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Just what I do, then.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

In the case of the pawn hash table, you are probably right. My evaluation hash table is much less effective when I don't share it between threads, though.

I dont know how you store things in eval cache , but I could not get it lockless ... mine is an extremely inefficient datastructure.
I found that there is like a 4% or so cache hit reduction when I did not share it , but the penality of sharing and locking it would be much worse I guess.
More than eval cache , I was interested in sharing killers ... they effect the move ordering and unpredictable orders are sometimes witnessed when different processes have different move ordering 'cos of their individual processes killer info (this can be damn frustrating when you are debugging the search tree dump to get to that crazy bug ...).
Thankfully , I dont use history :-P

In practise , it is ok to overwrite the values , mess it up , etc since we just use killers and history as hints ... but I kind of prefer more predictability.
Also , if you run on NUMA box , it makes more sense to try to maintain data used by a thread/process in its local memory (I did not want to design something which will have problems in the future 'when' I run on a NUMA box

).

Until you get parallel search working well , you will have endless hours of irritation (fun ?

) debugging your code/output , moments of "duh ! that was so stupid of me" when you find the bug and really really high amount of satisfaction when you see your program blaze away !

Have fun !
Hope this helps.
Mridul

Hope this helps,

It does. Thanks for the input!

Tord

by **Tord Romstad** » 12 Feb 2006, 11:25

mridul wrote:Hi Tord,

Tord Romstad wrote:Hi Mridul,

Thanks for your reply!
mridul wrote:Hi Tord,

Tord Romstad wrote:Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Congrats !

A bit too early to congratulate, I'm afraid. There are still a few bugs. I have seen the program crash a couple of times, and I have seen at least one mate announcement I don't quite believe in (posted at the CCC). The parallel search seems to be 99% correct, but I fear that getting it 100% correct will take a lot of time. I also suspect that going from 2 to n CPUs will prove far more difficult than going from 1 to 2 CPUs.

Until I got it 100% correct (I think it is) I had all sorts of weird bugs - shouldn't be too tough to notice. Like , your main pv/score will be different between single proc and parallel version : theoretically , this is possible to happen ... but in practise should happen extremely rarely.

It happens rather often to me, but this is not unexpected. Because of various theoretically unsound tricks I do in my search, my PVs and scores often change slightly even when I just change the hash table size in single CPU mode.

Even now , I just need to recompile with -DMP_DEBUG to test that parallel search is not broken.
All preconditions and postconditions where multiple processes interact (be it in locks or sharing data , etc) are tested for the assumption made while entry and exit time.
An example of a class of bugs which I found through this which was totally unintutive (atleast to me) was not setting a variable as 'volatile' when it should be...

I also had a few of these. In fact, I didn't even know what the 'volatile' keyword did until I started writing my parallel search on Thursday night. I have never written a multi-threaded C program before.

By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?

Well , the major problem with simulating MP search on a single processor is that the amount of concurrency you can simulate is dependent on the granuality of the OS context switch (we usually dont block on IO in chess engines) : and since this granuality is in order of milliseconds , we almost never notice any concurrency problems.

Even on a dual , this is markedly changed : you may not identify all the concurrency bugs , but the unpredictability of a context switch is much higher.
Before I tested on a quad , I just ran 4 process search on my dual and did find a bug (a minor one , thankfully ) ... did not catch any on the quad.
Ofcourse , having more number of processors to test on is always helpful
When you do this , the nps will suck , the scaling will suck - but the parallel will get validated to some extent until you find a quad or higher (I use spinlock type busy wait while searching for a split point in the children).

Thanks for the advice. I'll also just try running numerous threads on my dual, then.

A few questions to the more experienced SMP programmers:

Not really experienced , but I can give it a shot...
I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?

I dont split at the root too ... YMMV ofcourse.

OK, thanks.
If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?

Just to get an idea of what you are doing , consider this case :
* Processor 1 is searching at ply 2
* It decides that other processors can search this ply's subtree.
* Processor 2 starts searching a move from this subtree.
* Before processor 2 completes , all other moves are exhausted by processor 1 at that ply.

What do you do in this case ?

Processor 1 enters an idle loop, and waits for processor 2 to complete its work. While processor 1 is idle, processor 2 can give it work to do inside the sub-tree it is searching. The idea is called the "helpful master" concept in the YBW papers, I think.
I guess the answer to your question will depend on this ...

Why? Perhaps I am dense, but I cannot see how it has any relevance to my question.

The reason why I asked is to find out how you search , so that I will not say some stupid suggestion
If you dont split at root , and the initial thread which started search continues to be the owner of the search , then cant you not always return the currmove of that thread ? (and the pvline until it splits under a child).

You are right, at the moment there is no problem. Because I don't split at the root, both CPUs will always search the same root move. If you read my question more carefully, you will see that I asked how I am supposed to handle this problem if I should start splitting at the root. At the moment, it seems that this is not worth doing, so there is no problem.

The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord

There is a lockless way to access transposition table by Tim Mann (I think) - really useful

Yes, I have seen that (but I thought it was by Bob Hyatt), and intend to use it for my main transposition table.
I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Just what I do, then.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

In the case of the pawn hash table, you are probably right. My evaluation hash table is much less effective when I don't share it between threads, though.

I dont know how you store things in eval cache , but I could not get it lockless ... mine is an extremely inefficient datastructure.
I found that there is like a 4% or so cache hit reduction when I did not share it , but the penality of sharing and locking it would be much worse I guess.

The main benefit of the evaluation hash table for me is to re-use the computations from the previous iteration. I evaluate all internal nodes. At iteration N+1, most of the internal nodes will already have been evaluated at iteration N, and as long as the eval hash table is not saturated I get it all for free. With a separate eval hash table for each thread, it is much less effective.

It's no big deal, though. The eval hash table gives only a small speedup in any case.

More than eval cache , I was interested in sharing killers ... they effect the move ordering and unpredictable orders are sometimes witnessed when different processes have different move ordering 'cos of their individual processes killer info (this can be damn frustrating when you are debugging the search tree dump to get to that crazy bug ...).
Thankfully , I dont use history :-P

I haven't even started thinking about sharing killers.

In practise , it is ok to overwrite the values , mess it up , etc since we just use killers and history as hints ... but I kind of prefer more predictability.

Predictability is a lost case for me in any case, I fear.

Also , if you run on NUMA box , it makes more sense to try to maintain data used by a thread/process in its local memory (I did not want to design something which will have problems in the future 'when' I run on a NUMA box ).

No comments about this. I don't even know what NUMA is.

Until you get parallel search working well , you will have endless hours of irritation (fun ? ) debugging your code/output , moments of "duh ! that was so stupid of me" when you find the bug and really really high amount of satisfaction when you see your program blaze away !

In fact, it already seems to work very well. Since yesterday afternoon, I haven't found any new bugs. The program seems to be stable, and scores extremely well at test suites. Of course it is still possible that I will experience all sorts of problems when I start playing games, but so far I am optimistic.

Have fun !
Hope this helps.

It helps a lot, and I have great fun.

Tord

by **Daniel Shawul** » 15 Feb 2006, 07:36

Hi all

There is a lockless way to access transposition table by Tim Mann (I think) - really useful
I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

I dont lock tt table, i just check if the move i got is a valid move.
The is_this_a_valid_move function should check all kinds of things that a
move could be invalid. I dont share killers like you but i share the history table loackless. I do share pawn transposition table and eval cache, but i don't lock it either... Do you think this is a major problem?
Not sharing them is ofcourse one solution that i should try.

By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?

No. But you can work on the dual and get a 98% working parllel search.
What i did was write everything in to a log file, split info, current move being searched etc. Then i made Dann run some tests for me, which helped me catch some bugs related to concurrency issues. I think you have to do this once to be sure.

Quote:
Also , if you run on NUMA box , it makes more sense to try to maintain data used by a thread/process in its local memory (I did not want to design something which will have problems in the future 'when' I run on a NUMA box ).

No comments about this. I don't even know what NUMA is.

NUMA - non uniform memroy access. Each processor has its own local memroy where it can access it fast , i guess. In this case it may be good not to share many things. I think that the hash table is the only important thing that gives significant performance gain.

Daniel

by **Tord Romstad** » 15 Feb 2006, 13:20

Daniel Shawul wrote:Hi all

There is a lockless way to access transposition table by Tim Mann (I think) - really useful
I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.

Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...

I dont lock tt table, i just check if the move i got is a valid move.
The is_this_a_valid_move function should check all kinds of things that a
move could be invalid. I dont share killers like you but i share the history table loackless. I do share pawn transposition table and eval cache, but i don't lock it either...

All of this is exactly like I currently do it, except for the pawn hash table, where I use a separate table for each thread. So far, I haven't seen any problems caused by not locking the eval hash table and the main transposition table.

Do you think this is a major problem?

I don't have enough experience to give a good answer, but in my program sharing the pawn hash table without locking wouldn't work at all. It would cause the program to crash.

By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?

No. But you can work on the dual and get a 98% working parllel search.
What i did was write everything in to a log file, split info, current move being searched etc. Then i made Dann run some tests for me, which helped me catch some bugs related to concurrency issues. I think you have to do this once to be sure.

Thanks for the info. I think I will just postpone the problem of more than 2 threads until later. So far, there are very few users with more than 2 CPUs anyway.

NUMA - non uniform memroy access. Each processor has its own local memroy where it can access it fast , i guess. In this case it may be good not to share many things. I think that the hash table is the only important thing that gives significant performance gain.

Thanks for the explanation! Are computers using NUMA common? I really don't want to make several slightly different versions of my program optimised for different types of multi-CPU computers.

One more question: The SMP version of Glaurung works fine when it is the only CPU-intensive program on the computer, but in engine vs engine matches there is a problem: Even with pondering disabled, Glaurung hogs one of the two CPUs 100% when it's not thinking. The reason is that the helper thread is always running in an idle loop, waiting for something to search. The loop looks like this:

Code: Select all: void idle_loop(int thread_id, split_point_t *wait_sp) { Threads[thread_id].running = true; while(1) { if(AllThreadsShouldExit && thread_id != 0) break; if(Threads[thread_id].work_is_waiting) { Threads[thread_id].work_is_waiting = false; smp_search(Threads[thread_id].split_point, thread_id); Threads[thread_id].idle = true; } if(wait_sp != NULL && wait_sp->cpus == 0) return; if(wait_sp != NULL && RootSearchInfo.thinking_status == ABORTED) return; } Threads[thread_id].running = false; }

The helper thread enters this loop when the program starts up, and only leaves it when it is given something to search. Is there a simple way to reduce the priority of the helper thread when the program is not thinking?

Tord

by **Alessandro Scotti** » 15 Feb 2006, 14:07

Tord Romstad wrote:The helper thread enters this loop when the program starts up, and only leaves it when it is given something to search. Is there a simple way to reduce the priority of the helper thread when the program is not thinking?

You could block it on a condition when not thinking. Code would look like this (warning: untested and written online):

Code: Select all: pthread_mutex_t WaitLock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t WaitCond = PTHREAD_COND_INITIALIZER; ...in the helper thread... while( state == SIT_IDLE ) { pthread_mutex_lock( &WaitLock ); if( state == SIT_IDLE ) { pthread_cond_wait( &WaitCond, &WaitLock ); } pthread_mutex_unlock( &WaitLock ); } ...somewhere else the state is set... pthread_mutex_lock( &WaitLock ); state = DO_SOMETHING; // Was: SIT_IDLE pthread_cond_signal( &WaitCond ); pthread_mutex_unlock( &WaitLock );

Will cost you a few extra cycles but only when switching from idle to thinking.

by **Tord Romstad** » 15 Feb 2006, 14:29

Alessandro Scotti wrote:
Tord Romstad wrote:The helper thread enters this loop when the program starts up, and only leaves it when it is given something to search. Is there a simple way to reduce the priority of the helper thread when the program is not thinking?

You could block it on a condition when not thinking. Code would look like this (warning: untested and written online):

Code: Select all
pthread_mutex_t WaitLock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t WaitCond = PTHREAD_COND_INITIALIZER; ...in the helper thread... while( state == SIT_IDLE ) { pthread_mutex_lock( &WaitLock ); if( state == SIT_IDLE ) { pthread_cond_wait( &WaitCond, &WaitLock ); } pthread_mutex_unlock( &WaitLock ); } ...somewhere else the state is set... pthread_mutex_lock( &WaitLock ); state = DO_SOMETHING; // Was: SIT_IDLE pthread_cond_signal( &WaitCond ); pthread_mutex_unlock( &WaitLock );

Will cost you a few extra cycles but only when switching from idle to thinking.

Thanks!

pthread_cond_t was what I was looking for. I'll give your code a try as soon as my ongoing test match is finished (phenomenal results so far; I hope it's not just because Glaurung is cheating by stealing CPU time while the opponent is thinking).

How do I do something equivalent with Win32 threads? More generally, is there some kind of Win32 threads <-> POSIX threads dictionary to be found somewhere on the Web?

I apologise for all the elementary questions. This is my first multithreaded C program, and everything is new to me.

Tord

by **Alessandro Scotti** » 15 Feb 2006, 15:04

Tord Romstad wrote:pthread_cond_t was what I was looking for. I'll give your code a try as soon as my ongoing test match is finished (phenomenal results so far; I hope it's not just because Glaurung is cheating by stealing CPU time while the opponent is thinking).

Great! According to some estimates on CCC a good dual search will give 50-70 elo!!!

Tord Romstad wrote:How do I do something equivalent with Win32 threads? More generally, is there some kind of Win32 threads <-> POSIX threads dictionary to be found somewhere on the Web?

Threads are very easy to use in Win32. Here's some quotes adapted from Kiwi's system library:

Code: Select all: // Thread function DWORD WINAPI ReadInputThread( LPVOID param ); // Win32 void * ReadInputThread( void * param ); // POSIX // Create thread (Win32) DWORD dwId; HANDLE hInputThread = CreateThread( 0, 0, ReadInputThread, param, 0, &dwId ); if( hInputThread == INVALID_HANDLE_VALUE ) ...got an error // Condition code HANDLE SitIdleEvent = CreateEvent( 0, FALSE, FALSE, 0 ); // ...wait for SIT_IDLE WaitForSingleObject( SitIdleEvent, INFINITE ); //...or timeout in ms // ...signal event SetEvent( SitIdleEvent );

An event is similar to a condition but can only hold a TRUE/FALSE value (signaled/non-signaled in Windows lingo), which in this case I have associated to the "state == SIT_IDLE" condition.

by **Tord Romstad** » 15 Feb 2006, 15:19

Alessandro Scotti wrote:
Tord Romstad wrote:pthread_cond_t was what I was looking for. I'll give your code a try as soon as my ongoing test match is finished (phenomenal results so far; I hope it's not just because Glaurung is cheating by stealing CPU time while the opponent is thinking).

Great! According to some estimates on CCC a good dual search will give 50-70 elo!!!

My own estimate so far is an improvement of about 140 Elo compared to Glaurung 1.0.2, which is of course completely ridiculous. Except for the SMP code, there is no difference worth mentioning between the two versions. Because a speed improvement of 70% or so cannot possibly cause such a big Elo jump, I am almost completely sure something is horribly wrong with my test setup. The only other possible explanation is that I have accidentally fixed some catastrophic search bug while adding the SMP code.

Realistically, I'll be happy if my first dual version turns out to be 30-40 points stronger than Glaurung 1.0.2.

Tord Romstad wrote:How do I do something equivalent with Win32 threads? More generally, is there some kind of Win32 threads <-> POSIX threads dictionary to be found somewhere on the Web?

Threads are very easy to use in Win32. Here's some quotes adapted from Kiwi's system library:

(Code snipped)

Thanks! Looks reasonably simple.

Tord

by **Daniel Shawul** » 15 Feb 2006, 17:34

The helper thread enters this loop when the program starts up, and only leaves it when it is given something to search. Is there a simple way to reduce the priority of the helper thread when the program is not thinking?

I think you have to either kill the thread or wait on a condition. What i do is wait for a condition every 500 millesecods ,this steals very little processor time. There is a Sleep(x) command in Windows which i use both for spinlocks and blocking threads. I dont know if there is an equivalent command in linux for this.

Daniel

by **Miguel A. Ballicora** » 16 Feb 2006, 03:28

Tord Romstad wrote:
Alessandro Scotti wrote:
Tord Romstad wrote:pthread_cond_t was what I was looking for. I'll give your code a try as soon as my ongoing test match is finished (phenomenal results so far; I hope it's not just because Glaurung is cheating by stealing CPU time while the opponent is thinking).

Great! According to some estimates on CCC a good dual search will give 50-70 elo!!!

My own estimate so far is an improvement of about 140 Elo compared to Glaurung 1.0.2, which is of course completely ridiculous. Except for the SMP code, there is no difference worth mentioning between the two versions. Because a speed improvement of 70% or so cannot possibly cause such a big Elo jump, I am almost completely sure something is horribly wrong with my test setup. The only other possible explanation is that I have accidentally fixed some catastrophic search bug while adding the SMP code.

with SMP, the search is different. In theory, IMHO, if your ordering is suboptimal, a parallel search will correct it (you go to the second move in the list way much quicker). We have a huge heated discussion about this in CCC ~3 years ago. Maybe 140 elo points is too much, but...

Miguel

by **Scott Gasch** » 08 Mar 2006, 05:18

Tord,

Congrats on getting your parallel search working. I don't know anything about splitting at the root; I don't do it for the same reasons you mentioned.

As for shared hash tables, I have tried both dual locks in the pawn hash table and one pawn hash table per searcher thread. I see no speed difference between the two on a dual proc machine. I think this is becasue I'm only running 2 worker threads and even if they are evaluating a position with the same pawn hash key the second one will use a different hash table slot. The reason I experimented with one hash table per thread was really to make sure that I didn't run into a bottleneck on machines with more than 2 cpus. Since I don't have one yet I left the dual probing pawn hash table in.

I don't do eval term hashing.

I have heard a lot of people with clever ways to share the main hash table without locks... but I never really understood how it worked. The way I do it is by locking "partitions" of the main hash table. Currently I think I lock 1/64th of the main hash table at a time. If the two threads are unfortunate enough to be operating on hash entrys that are in the same partition (i.e. thread A is on entry N and thread B is on entry N +/- M*64) then the second one in waits. I experimented with the size of the partition and once I got to 64 it made no impact on search speed because collisions happen so infrequently.

I read somewhere that all Opteron-based machines with more than one processor are technically NUMA though I don't know if this is accurate.

Your helper idle loop question is tricky. Here's what I think of it:

1. if you busy loop then you will eat up all spare cpu cycles and peg one cpu at 100% used but as soon as another thread needs your help you will respond and start helping.

2. if you give up your thread quantum (via Sleep(1) or usleep(1) or whatever) then you will not peg the cpu as much... your system will be much more responsive. But the disadvantage is that you may miss a "wakeup and come help" call from the other thread. The worst case scenario here is: Thread A calls Sleep(1) to give up a quantum, Thread B immediately says "Come help me search", Thread A does not get the message until the next time it is scheduled... depending on other ready threads on the system this is probably at least ~10ms later.

3. Another alternative is to have the worker threads wait on an object like a condition variable, win32 EVENT, win32 CriticalSection (which is really just a SEMAPHORE wrapped in a struct). The advantage to these is (at least for win32) as soon as one thread signals the object the waiters are woken up and begin to work. So... if the "come and help search" was one of these objects the helper threads can wait on them and not consume cpu cycles... while at the same time getting to help instantly when they are asked. The only drawbacks here is that these things may be non-portable.

4. Another options might be to have a spin loop that, after looping N times, sleeps to give up its quantum.

Anyway, I'm sure you'll think of something good. As for me, I use method 1 -- just busy loop and eat CPU cycles.

FWIW, my code is online at http://wannabe.guru.org/svn/typhoon/trunk --My SMP stuff is in search.c and split.c.

Scott

by **Tord Romstad** » 08 Mar 2006, 10:19

Hi Scott,

Thanks for your reply!

Scott Gasch wrote:Congrats on getting your parallel search working. I don't know anything about splitting at the root; I don't do it for the same reasons you mentioned.

I've actually tried splitting at the root, but it doesn't seem to help at all. The trees are no smaller than when I only split at internal nodes. This is counter-intuitive to me. The root looks like a perfect node to split, because the remaining depth is bigger than anywhere else in the tree, and because we know that we will always search all moves.

As for shared hash tables, I have tried both dual locks in the pawn hash table and one pawn hash table per searcher thread. I see no speed difference between the two on a dual proc machine. I think this is becasue I'm only running 2 worker threads and even if they are evaluating a position with the same pawn hash key the second one will use a different hash table slot. The reason I experimented with one hash table per thread was really to make sure that I didn't run into a bottleneck on machines with more than 2 cpus. Since I don't have one yet I left the dual probing pawn hash table in.

Sounds like we do the same. I also have a separate pawn hash table for each thread.

I don't do eval term hashing.

I do, currently with a single, lockless table. The risk seems to be fairly small. So far I haven't seen any problems caused by this.

I have heard a lot of people with clever ways to share the main hash table without locks... but I never really understood how it worked.

Bob describes a fairly simple trick to handle this on his web site.

The way I do it is by locking "partitions" of the main hash table. Currently I think I lock 1/64th of the main hash table at a time. If the two threads are unfortunate enough to be operating on hash entrys that are in the same partition (i.e. thread A is on entry N and thread B is on entry N +/- M*64) then the second one in waits. I experimented with the size of the partition and once I got to 64 it made no impact on search speed because collisions happen so infrequently.

I just do the same thing here as with my eval hash table: I have a single, lockless transposition table, and assume that collisions are so rare that they will never cause problems. It seems to be safe enough.

Your helper idle loop question is tricky.

Actually, it wasn't. Alessandro's suggestions earlier in this thread solved all problems.

FWIW, my code is online at http://wannabe.guru.org/svn/typhoon/trunk --My SMP stuff is in search.c and split.c.

Cool! I'll have a look at it. As you have probably seen, my code is also released, and is found at http://www.glaurungchess.com. The SMP stuff is found in search.cpp and smp.cpp.

Tord

by **Gian-Carlo Pascutto** » 08 Mar 2006, 13:45

Tord Romstad wrote:The root looks like a perfect node to split, because the remaining depth is bigger than anywhere else in the tree, and because we know that we will always search all moves.
Tord

But it's a PV node, meaning, that if the PV changes, you have inefficiency. And it's going to be quite a big inefficiency because the trees are so big.

My experience with trying to find better splitting algorithms always turned out the same: make sure all CPUs are busy at all times, and mostly forget about everything else

by **Tord Romstad** » 08 Mar 2006, 14:22

Gian-Carlo Pascutto wrote:
Tord Romstad wrote:The root looks like a perfect node to split, because the remaining depth is bigger than anywhere else in the tree, and because we know that we will always search all moves.
Tord

But it's a PV node, meaning, that if the PV changes, you have inefficiency. And it's going to be quite a big inefficiency because the trees are so big.

Hm... Perhaps it is worth a try to avoid splitting at internal PV nodes, too? I'll experiment with this and see what I find.

My experience with trying to find better splitting algorithms always turned out the same: make sure all CPUs are busy at all times, and mostly forget about everything else

It's not quite that simple, I think. If it were, we would all use ABDADA or one of the other boring shared hash table techniques.

Tord

by **Alessandro Scotti** » 08 Mar 2006, 16:27

Tord Romstad wrote:It's not quite that simple, I think. If it were, we would all use ABDADA or one of the other boring shared hash table techniques.

Hi Tord,
do you know where I can find a description of ABDADA?

by **Tord Romstad** » 08 Mar 2006, 16:39

Alessandro Scotti wrote:Hi Tord,
do you know where I can find a description of ABDADA?

Sure!
http://dx.doi.org/10.1145/228329.228345
Opinions differ about how good it really is. I have seen people claim that ABDADA isn't very efficient when combined with recursive null move search (which makes it rather pointless for most chess programs) and that it doesn't scale well beyond 2 CPUs, but I haven't tried it myself.

YBWC is much more fun, however. And contrary to popular belief, it isn't really more difficult to implement than shared hash table algorithms.

Tord

by **Alessandro Scotti** » 08 Mar 2006, 17:06

Hi Tord,
thanks for the link! Unfortunately ACM wants me to buy the article, which suddenly makes the algorithm a lot less attractive! ;-)

Tord Romstad wrote:YBWC is much more fun, however. And contrary to popular belief, it isn't really more difficult to implement than shared hash table algorithms.

Well I'm about to discover I guess! :-)

Difficult or not, just making a "mental picture" of a splitpoint and how the master thread at a split point can cooperate with other threads at other splitpoints had my head spinning for a good week or so! :shock:

This is definitely something I want to carefully layout on paper before starting to write even a single line of code...

Winboard Forum

Assorted SMP questions

Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

Who is online