Winboard Forum

by **Pradu** » 31 Aug 2007, 02:02

It'll take much effort reinventing techniques and making/learning from mistakes others have already painstakenly done; so, I have a few questions (that I could not find answers to easily on the internet) to people who already have lot of experience with parallel search.

How would you handle move ordering differently for the parallel search? For example, would you share killers for the next ply at the split point? What would be the best way to handle hash tables (transposition table, pawn hash, ect.)? How would you make history heuristic work best? Would you want to spend time copying all move ordering data (like history, hash) when splitting?
For Buzz, I only split at alpha+1==beta so I don't have to check for bounds updating; however, I still have to check for cuttoffs. For this I loop through the entire search stack at every new node and poll the split points for a cutoff flag. Is there a more efficient/elegant way to do this?
When creating a split point, I use malloc; when destroying it, I use free. Does it matter whether I use dynamic or static memory allocation for performance?
I heard some postings here on "buffering 64-bytes between data structures or something". What is this? Why would you do it?
Are there other coding tips you would give for better scaling?

Thanks in advance.

by **bob** » 31 Aug 2007, 15:53

Pradu wrote:It'll take much effort reinventing techniques and making/learning from mistakes others have already painstakenly done; so, I have a few questions (that I could not find answers to easily on the internet) to people who already have lot of experience with parallel search.
How would you handle move ordering differently for the parallel search? For example, would you share killers for the next ply at the split point? What would be the best way to handle hash tables (transposition table, pawn hash, ect.)? How would you make history heuristic work best? Would you want to spend time copying all move ordering data (like history, hash) when splitting?

I would not change anything. If it works for the sequential search, it should work just fine for the parallel search. How you share killers and such will take some testing. I share them myself, but once I split, each processor has its own local copy of them that it updates as it searches.

For Buzz, I only split at alpha+1==beta so I don't have to check for bounds updating; however, I still have to check for cuttoffs. For this I loop through the entire search stack at every new node and poll the split points for a cutoff flag. Is there a more efficient/elegant way to do this?

When a processor gets a cutoff, simply have it inform others at that split point that there is a better bound. It doesn't happen often, so continually checking for this is a waste.

When creating a split point, I use malloc; when destroying it, I use free. Does it matter whether I use dynamic or static memory allocation for performance?

Lose the malloc() stuff. Makes the cost of doing a split far higher than necessary, which could include some paging. Make them static and define enough so that you _never_ run out on the hardware you are using.

I heard some postings here on "buffering 64-bytes between data structures or something". What is this? Why would you do it?

Cache reads 64 byte chunks of data from main memory. You need to take advantage of that to reduce memory traffic.

Are there other coding tips you would give for better scaling?
Thanks in advance.

Watch how you use shared variables that more than one thread will modify. There is a danger of a race condition, as well as a danger of way excessive cache overhead.

by **Zach Wegner** » 02 Sep 2007, 00:29

Pradu wrote:It'll take much effort reinventing techniques and making/learning from mistakes others have already painstakenly done; so, I have a few questions (that I could not find answers to easily on the internet) to people who already have lot of experience with parallel search.
How would you handle move ordering differently for the parallel search? For example, would you share killers for the next ply at the split point? What would be the best way to handle hash tables (transposition table, pawn hash, ect.)? How would you make history heuristic work best? Would you want to spend time copying all move ordering data (like history, hash) when splitting?

Since I use processes rather than threads, the only thing I share is the main hash table. Eval tables, pawn tables, killers, history, counters, etc. are all private. Since these don't make too much difference for move ordering, and since most children of split nodes will be cut nodes anyway, I don't bother sharing them.

For Buzz, I only split at alpha+1==beta so I don't have to check for bounds updating; however, I still have to check for cuttoffs. For this I loop through the entire search stack at every new node and poll the split points for a cutoff flag. Is there a more efficient/elegant way to do this?

Personally I use a message passing system. Before, when I used a PVS like approach to splitting and each processor could only be part of one split point in the whole search stack, I just set a flag in the split point that each processor polled. However, when I got to a stack approach where there are multiple split points, I needed to implement message passing. This also makes it more efficient on NUMA-like machines, where polling a value in distant memory can be very expensive. Rather it just polls local memory. I have several message types:

Code: Select all: typedef enum { SMP_HELP = 1, SMP_UPDATE, SMP_UNSPLIT, SMP_STOP } SMP_MESSAGE;

You also need a data element, as well as a unique split point identifier. This is to prevent conditions where a processor gets a cutoff right when another processor is leaving a node, and that processor doesn't get the STOP message until it has already left the split point.

Zach

by **bob** » 02 Sep 2007, 17:23

Zach Wegner wrote:
Pradu wrote:It'll take much effort reinventing techniques and making/learning from mistakes others have already painstakenly done; so, I have a few questions (that I could not find answers to easily on the internet) to people who already have lot of experience with parallel search.
How would you handle move ordering differently for the parallel search? For example, would you share killers for the next ply at the split point? What would be the best way to handle hash tables (transposition table, pawn hash, ect.)? How would you make history heuristic work best? Would you want to spend time copying all move ordering data (like history, hash) when splitting?

Since I use processes rather than threads, the only thing I share is the main hash table. Eval tables, pawn tables, killers, history, counters, etc. are all private. Since these don't make too much difference for move ordering, and since most children of split nodes will be cut nodes anyway, I don't bother sharing them.

You are probably missing a lot of easy cutoffs that way. Killer moves should definitely be shared between threads, at least when the split is done. Keeping them local after that makes better since since killer moves are somewhat local in nature. But at least inheriting what has worked before, when splitting, makes the search below that point more efficient. Ditto for things like pawn score hashing. Why have each thread evaluate each identical pawn position more than once? Once a position has been evaluated, it makes good sense to hash the results. Anything that is good for one thread is good for two or more. Children of split nodes are not always CUT nodes. At the next ply, yes. But you have more plies to search below that, and below a CUT node, there must be an ALL node.

For Buzz, I only split at alpha+1==beta so I don't have to check for bounds updating; however, I still have to check for cuttoffs. For this I loop through the entire search stack at every new node and poll the split points for a cutoff flag. Is there a more efficient/elegant way to do this?
Personally I use a message passing system. Before, when I used a PVS like approach to splitting and each processor could only be part of one split point in the whole search stack, I just set a flag in the split point that each processor polled. However, when I got to a stack approach where there are multiple split points, I needed to implement message passing. This also makes it more efficient on NUMA-like machines, where polling a value in distant memory can be very expensive. Rather it just polls local memory. I have several message types:
Code: Select all
typedef enum { SMP_HELP = 1, SMP_UPDATE, SMP_UNSPLIT, SMP_STOP } SMP_MESSAGE;
You also need a data element, as well as a unique split point identifier. This is to prevent conditions where a processor gets a cutoff right when another processor is leaving a node, and that processor doesn't get the STOP message until it has already left the split point.

Zach

by **Pradu** » 02 Sep 2007, 17:45

Zach Wegner wrote:Since I use processes rather than threads, the only thing I share is the main hash table. Eval tables, pawn tables, killers, history, counters, etc. are all private. Since these don't make too much difference for move ordering, and since most children of split nodes will be cut nodes anyway, I don't bother sharing them.

By sharing killers, I meant for the children of the split node (the CUT node). For example, one thread searches the first move, and it fails low as expected, also it got a killer move for the CUT node below it. The question is, is it worth informing the other threads of this killer?

Personally I use a message passing system. Before, when I used a PVS like approach to splitting and each processor could only be part of one split point in the whole search stack, I just set a flag in the split point that each processor polled. However, when I got to a stack approach where there are multiple split points, I needed to implement message passing. This also makes it more efficient on NUMA-like machines, where polling a value in distant memory can be very expensive. Rather it just polls local memory. I have several message types:
Zach

Interesting. I guess I'll have to ditch my work stealing approach for this to work well. How do you do race conditions for the message passing? Say one sends the stop condition to terminate the search thread but another sends a cuttoff condition to just go back into it's spinning loop for more work. If the cuttoff comes after the stop message came and replaced your flag, then it might be possible that the search would not be terminated. Do you lock the flag until the message has been executed or something?

by **Pradu** » 02 Sep 2007, 17:47

bob wrote:When a processor gets a cutoff, simply have it inform others at that split point that there is a better bound. It doesn't happen often, so continually checking for this is a waste.

What is the mechanism used to inform the others? Was it just using plain C or something to do with pthread-signals (you mentioned signals in your paper on DTS, but I haven't read up on how pthread-signals work yet). I guess my question is what is the most efficient way to implement the informing system.

The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors.

Oh and one more thing, in Cray Blitz, how did you keep track of the best node to split at and how would you inform the thread that's already searching that best splitpoint node that it became a split point and how does the thread that's already searching that node handle that.

by **bob** » 03 Sep 2007, 15:52

Pradu wrote:
bob wrote:When a processor gets a cutoff, simply have it inform others at that split point that there is a better bound. It doesn't happen often, so continually checking for this is a waste.
What is the mechanism used to inform the others? Was it just using plain C or something to do with pthread-signals (you mentioned signals in your paper on DTS, but I haven't read up on how pthread-signals work yet). I guess my question is what is the most efficient way to implement the informing system.
The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors.

Two answers. In Cray Blitz it was easy to do since all processors used an iterated search with all the stuff visible to everyone. There I just set a flag for anyone working at a common split point that said "copy the new A-B bounds as they have narrowed..."

In Crafty I don't bother, because I still use a recursive search, and it is not possible to update the alpha/beta values since they are passed in through a series of recursive calls, and this is not possible to do in a way that is portable across platforms and compilers.

Oh and one more thing, in Cray Blitz, how did you keep track of the best node to split at and how would you inform the thread that's already searching that best splitpoint node that it became a split point and how does the thread that's already searching that node handle that.

Since all data in CB is global, with respect to the tree stuff and split blocks, it was possible for A to be searching, and for B to split with A without A doing anything at all. Whenever A would "back up" to a split point, it just suddenly found it there, as though it had been there all along. Since B could see A's data, it could link things in properly. To choose the "best node" I just looked at the active trees for all active processors and examined them node by node to find the best candidate, the one most likely to remain an ALL node.

In a simple case, you could do a "global YBW" algorithm, where you simply look at the current state of all active processors, and find the list of nodes where at least one branch has already been completely searched. You can also weight the depth of each of these nodes and the number of moves already searched. Pick shallower depths (deeper remaining depth) to reduce the number of splits you do. Pick a split point with several moves already searched as you can be pretty confident that the rest need to be searched. And you can look at the alpha/beta bounds at all nodes below the candidate split point to make sure nothing unusual is going on (we might be searching something that will make us fail high at the candidate split point.) There are several things you can do to improve your chances of picking a real ALL node and avoid any extra search overhead. But in a recursive implementation like mine, it just isn't possible. And I give up the efficiency that offers.

by **Zach Wegner** » 04 Sep 2007, 17:02

By sharing killers, I meant for the children of the split node (the CUT node). For example, one thread searches the first move, and it fails low as expected, also it got a killer move for the CUT node below it. The question is, is it worth informing the other threads of this killer?

I am really not sure. It's just in my approach, the only feasible option would be to copy the current killers whenever a split occurs. Updating killers while searching would become too expensive. My hope is that my split point selecting code will become efficient enough that the nodes on the ply after the split node will be easy cut nodes, and won't need killers.

Pradu wrote:Interesting. I guess I'll have to ditch my work stealing approach for this to work well. How do you do race conditions for the message passing? Say one sends the stop condition to terminate the search thread but another sends a cuttoff condition to just go back into it's spinning loop for more work. If the cuttoff comes after the stop message came and replaced your flag, then it might be possible that the search would not be terminated. Do you lock the flag until the message has been executed or something?

Well, I actually have two types of signals. One is synchronous, i.e. the sending processor waits until the receiving processor has processed the message before continuing. This is done by the sending process setting a lock in the other process's local memory. The other messages, the kind I was talking about earlier, are asynchronous. For these, I had to implement a stack of messages. Each time there was a cutoff, the sending processor sets a lock, increments a message count, and sets the message in the appropriate stack element. Then the lock is released. Each element contains the message type and a data element specifying the split point ID. When the receiving processor notices that there are messages waiting, it sets the lock, and processes all messages in FIFO order.

by **Tord Romstad** » 09 Sep 2007, 17:20

Hello Pradu,

These are my answers, but please take them with a grain of salt. Bob is the real expert, of course.

Pradu wrote:How would you handle move ordering differently for the parallel search?

In my case, exactly like for my old non-parallel search.

For example, would you share killers for the next ply at the split point?

No, I don't. I'm not sure how much it would help.

What would be the best way to handle hash tables (transposition table, pawn hash, ect.)?

For the transposition table, I do nothing at all. All threads share the same transposition table, without locking. I always check hash moves for legality, though. For pawn and material hash tables, I have a separate table for each thread.

How would you make history heuristic work best? Would you want to spend time copying all move ordering data (like history, hash) when splitting?

I once tried giving each thread its own history table, but it didn't work well. I now let all threads share the same table (again, without locking), hence no copying is necessary.

For Buzz, I only split at alpha+1==beta

I think this is a bad idea. I did the same thing in the first (non-public) parallel version of Glaurung 2, and it turned out that the efficiency of the parallel search improved enormously when I allowed to split at nodes where alpha + 1 < beta.

so I don't have to check for bounds updating; however, I still have to check for cuttoffs. For this I loop through the entire search stack at every new node and poll the split points for a cutoff flag. Is there a more efficient/elegant way to do this?

I do something similar to what you do. Each split point contains a pointer to its "parent split point", i.e. the closest split point along the path back to the root of the search tree (this pointer is NULL if there is no split point along the path to the root). When testing whether a thread should stop searching, I look at its parent split point, its grandparent split point, and so on all the way to the root. If a beta cutoff has occured at any of these split points, the thread stops its current search, and returns to its idle loop.

When creating a split point, I use malloc; when destroying it, I use free. Does it matter whether I use dynamic or static memory allocation for performance?

Yes, probably. I think it is much better to preallocate a pool of split point objects during program initialization.

Tord

by **bob** » 09 Sep 2007, 18:26

Just a note. I think I am fixing to remove the "lockless xor" stuff in Crafty. After the collisions tests Cozzie and I ran, it became pretty obvious than an error here and there, once or two an hour, will have zero effect on the search. And since I already check moves for legality since I can't recover from a bad move if I make it, I'm going to eliminate the xor overhead completely and gain back a little speed on the hash probes/stores...

As far as sharing history/killers, I think it requires caution. Killer moves are very local in nature. When I am at ply=15 searching a sub-tree, the killers below that point are really primarily useful for this particular area of the tree. If another processor is splitting at a shallow ply, I'd prefer to not have it changing the killers I use way over here. I do "seed" the split block killer moves when I split, by copying from the parent to the child, but each child has its own set of killers to update from that point on in the search.

The only minor problematic point is when you clean up a split point, the killers at the previous ply probably don't get copied back (I don't do this, but I might test it). So once a split point is finished, the killers are effectively restored to exactly what they were at the instant the split was done.

I'm not using any sort of history counters at all, at the present. My feeling is that shared history counters are a bit too "noisy" as well for the reason given above...

Winboard Forum

A Few General Questions on Parallel Search

A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Re: A Few General Questions on Parallel Search

Who is online