Winboard Forum

by **Volker Böhm** » 28 Sep 2007, 10:32

Hi,

currently I am working at tuning the parallel algorithm of Spike using a quad-core (opteron) computer - no I haven´t got one of the new 4 core opteron chips.

I am currently not able to archive a speedup > 3. As far as I remember Bob Hyatt wrote somewhere that he gained a speedup of about 3.8.

I do think that a strong move reduktion algorithm will cut down the speedup as only few moves have large node counts.

Measurement of speedup:
For speedup measurement I use the 100 test positions and calculate all of them to a fixed depth. I do that with the single cpu version and compare the runtime to the 4 cpu version. I currently use depth 14 that´ll give me about 10 seconds for a position.

I wonder which speedup you get and what are your "triggers" for speedup.
In Spike I have the following results:

Average node factor: ca. 1.19
Average nps factor: ca. 3.5
Average speedup: ca. 2.9

The results varies between 2.7 and 3.1 thus 2.9 is an average of 30 searches of all 100 test positions.

The nps factor gets better with longer searches pushing the speedup to a little more than 3.

I do loose time waiting for a suitable split point. There are possibilities to grow the nps factor by using bad split points but this has a tradeoff in the node factor.

positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

Debugging is hard. Lately I found a bug in terminating helper threads on beta-cutoffs in the split points. Thus maybe there are bugs left.

About Spike MP Algorithm:
Spike uses one thread for every cpu and one thread for the input handling. Threads without work are scanning for work offer of the threads with work. If there are multiple work offers they try to pick the best one.
A thread that terminated work is able to help his helper. Alpha >= Beta will terminate every thread in this brance, helper, helper-helper, ...
I currently don´t terminate calulations in pv nodes on alpha increase and start them again with smaller windows. My last implementation didn´t gain a speedup.

What are your experiences with mp speedups?

Greetings Volker

by **bob** » 28 Sep 2007, 17:02

Volker Böhm wrote:Hi,

currently I am working at tuning the parallel algorithm of Spike using a quad-core (opteron) computer - no I haven´t got one of the new 4 core opteron chips.

I am currently not able to archive a speedup > 3. As far as I remember Bob Hyatt wrote somewhere that he gained a speedup of about 3.8.

Wasn't me. My speedup roughly follows this:

speedup = 1 + .7 * (NCPUS - 1)

For 4 processors that turns into 3.1 or so. Tests posted here a couple of years ago put it closer to 3.3, but the formula is not that far off.

I do think that a strong move reduktion algorithm will cut down the speedup as only few moves have large node counts.

Measurement of speedup:
For speedup measurement I use the 100 test positions and calculate all of them to a fixed depth. I do that with the single cpu version and compare the runtime to the 4 cpu version. I currently use depth 14 that´ll give me about 10 seconds for a position.

I wonder which speedup you get and what are your "triggers" for speedup.
In Spike I have the following results:

Average node factor: ca. 1.19
Average nps factor: ca. 3.5
Average speedup: ca. 2.9

The results varies between 2.7 and 3.1 thus 2.9 is an average of 30 searches of all 100 test positions.

The nps factor gets better with longer searches pushing the speedup to a little more than 3.

Sounds like you are doing pretty well. First thing to do is to get your NPS up to 4X the single-cpu NPS. That can be done. Otherwise you are losing 1/2 of one CPU before you get started and you have already given yourself a max speedup of 3.5... This is called "NPS scaling" and 100% is desirable. Usually it is a cache-based issue that you just have to work on to avoid interactions between processors that are not intended.

I do loose time waiting for a suitable split point. There are possibilities to grow the nps factor by using bad split points but this has a tradeoff in the node factor.

positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

The second two above are not good ideas. You _never_ want to split at a CUT node. Never. Just because you have winning captures left is also not a good reason to split. It only takes one winning capture to produce a cutoff, so why search N in parallel and waste the time>?

In crafty, the last point is tunable. There is a split cost that has to be absorbed. Splitting too near the tips increases the number of splits, and hence the cost that has to be absorbed...

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

Debugging is hard. Lately I found a bug in terminating helper threads on beta-cutoffs in the split points. Thus maybe there are bugs left.

About Spike MP Algorithm:
Spike uses one thread for every cpu and one thread for the input handling. Threads without work are scanning for work offer of the threads with work. If there are multiple work offers they try to pick the best one.
A thread that terminated work is able to help his helper. Alpha >= Beta will terminate every thread in this brance, helper, helper-helper, ...
I currently don´t terminate calulations in pv nodes on alpha increase and start them again with smaller windows. My last implementation didn´t gain a speedup.

What are your experiences with mp speedups?

Greetings Volker

by **Volker Böhm** » 29 Sep 2007, 20:29

Hi Bob,

thanks for your hints. But I think I have some other problems with spike. I´ve done more time analysis (using amd CodeAnalyst). When starting a search I have a NPS factor of 3 after 1 sec. it gets 3.5 after 10 sec. and to 3.7 after 1 Miniute. Longer search doesn´t get better.

In 1 minute searches I loose 1- 3.7/4 = 7,5% of total cpu time compared to the optimal factor of 4. These 7,5% seperates in following "problems":
1%: Time not used for spike (kernel, ...)
1%: Hash set and get compared to the single version
1%: Move generator (spike has a lasy move generator that is not suitable for mp split positions. Thus on potential split positions I generate all moves)
3-4%: time to find and compare split postions.
The rest, don´t know

In an one second search the time to find and compare split positions increase to 20% of total cpu time.

Thus my problem is that I use too much time to find split postions. There are many cases where there are no split positons. Thus I think that a large amount of time is used by looping until a split position is present. Maybe I do something non optimal here?

I do not split:
1. At root
2. The last two plys if futility pruning is possible
3. At PV ply if not at least one move is finished
4. At All ply if not at least one move is finished
5. At Cut ply if there is any other split possiblitiy or if there is not at least one move finished

Not using Cut plys at all does increase my problem of threads doing nothing.

How does crafty behave in this issue, how many time is used by doing nothing in an one second search? (2*2 Opteron, 2 GHZ).

Greetings Volker

by **bob** » 30 Sep 2007, 03:44

Volker Böhm wrote:Hi Bob,

thanks for your hints. But I think I have some other problems with spike. I´ve done more time analysis (using amd CodeAnalyst). When starting a search I have a NPS factor of 3 after 1 sec. it gets 3.5 after 10 sec. and to 3.7 after 1 Miniute. Longer search doesn´t get better.

OK. That sounds like you have excessive waiting at shallow iterations, that slowly drops as you get to deeper iterations. I generally split until remaining depth is 3-4 plies or less (the number varies depending on the processors and speed, but 3-4 is typical). If I make that more conservative, I get fewer wasted splits, and lower NPS as well, and eventually I will reach a point where I get a longer time to solution. My test is always to run several different kinds of test positions, and choose the split criteria that gives me the lowest time to fixed depth for the entire set.

In 1 minute searches I loose 1- 3.7/4 = 7,5% of total cpu time compared to the optimal factor of 4. These 7,5% seperates in following "problems":
1%: Time not used for spike (kernel, ...)

What is that? Crafty generally gets 399% out of a 4 cpu system. I see no reason for 1%+ kernel overhead in a chess engine that ought to be computing chess rather than making lots of system calls.

1%: Hash set and get compared to the single version

What is that? I don't have any multi-cpu hash overhead so I am not sure what you mean there...

1%: Move generator (spike has a lasy move generator that is not suitable for mp split positions. Thus on potential split positions I generate all moves)

Why won't that work on split nodes? I generate captures and then non-captures, even at split points...

3-4%: time to find and compare split postions.
The rest, don´t know

In an one second search the time to find and compare split positions increase to 20% of total cpu time.

Thus my problem is that I use too much time to find split postions. There are many cases where there are no split positons. Thus I think that a large amount of time is used by looping until a split position is present. Maybe I do something non optimal here?

I do not split:
1. At root

Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...

2. The last two plys if futility pruning is possible
3. At PV ply if not at least one move is finished
4. At All ply if not at least one move is finished
5. At Cut ply if there is any other split possiblitiy or if there is not at least one move finished

Not using Cut plys at all does increase my problem of threads doing nothing.

Main issue there is how accurate is your ALL node identification code? If it is good, then splitting at a CUT node is worthless since you should be right there as well. And splitting at a real CUT node is no good for obvious reasons...

How does crafty behave in this issue, how many time is used by doing nothing in an one second search? (2*2 Opteron, 2 GHZ).

I no longer measure this because measuring CPU time is difficult across a wide variety of platforms. But in years gone by, the number was _very_ low. I used to see 399% cpu utilization in blitz games played on ICC. I don't currently have code to measure the amount of time spent "idle".

Greetings Volker

by **Volker Böhm** » 01 Oct 2007, 08:48

Hi,

with "doing nothing" I didn´t mean having idle time, but waiting for split points in an endless loop consuming cpu power. I do have a little idle problem too but compared to the "no split position problem".

Maybe splitting at root solves much of my problems. Will try that next. Some thoughts:
When the first move is searched at root I get a "guaranteed" all node at every first answer move (the leftmost branch at the search tree). Thus there are nearly allways suitable split points when searching the first node at root. (Not really "guaranteed all node" for spike as I use a shrunk search window allready at root). Thus the problem should be only at every but the first root move. The answer move of a "non-first" root move is typically a cut move - thus again no split point.
Maybe my wait for split point will be drastically reduced with splitting at root. Maybe even some of the idle time will be lower.
I didn´t split at root before because I would have to handle the "parallel output problem". I am not really able any more to tell the gui which node is currently searched as that may be multiples.

On the issue "not splitting at cut nodes". I am not sure if this is allways right. Imagine the following situation:

Most cut nodes I split have a hash move (at least 3-4 plys away from horizont, thus enough possibilities to store a move at hash)
I do not split before the first move is fully searched.

Thus in effect I split at node that should be a cut node but the hash move does not leads to a cut. Maybe it is still a cut move but I have to search some more moves to find out.
Here I have the following possibilities:
1. The second move leads to a cutoff (bad idea to split)
2. Some later move leads to a cutoff (not such a bad idea to split as with a strong late move reduction most nodes are searched only at a few early moves thus not such a big difference to an all node)
3. It is indeed an all node. Then it is a very good idea to split here

Analyzing split positions I found out that probable all nodes that turns out to be cut nodes often searches all moves or at least a large amount of moves. The reason is that the move with a value >= beta needs such a large amount of nodes that all other moves are searched in parallel (late moves are searched up to two plys less deep having much less nodes).
Thus maybe It is a good idea to find those nodes that starts as cut nodes but turns out to be all nodes very fast and before searching other moves at the split position below.

by **bob** » 01 Oct 2007, 17:18

Volker Böhm wrote:Hi,

with "doing nothing" I didn´t mean having idle time, but waiting for split points in an endless loop consuming cpu power. I do have a little idle problem too but compared to the "no split position problem".

Maybe splitting at root solves much of my problems. Will try that next. Some thoughts:
When the first move is searched at root I get a "guaranteed" all node at every first answer move (the leftmost branch at the search tree). Thus there are nearly allways suitable split points when searching the first node at root. (Not really "guaranteed all node" for spike as I use a shrunk search window allready at root). Thus the problem should be only at every but the first root move. The answer move of a "non-first" root move is typically a cut move - thus again no split point.

A couple of points. First, ply=2 is not guaranteed to be an ALL node. In fact, except for the first root move, they are hopefully all going to be CUT nodes.

However, even for the first root move, you don't want to initially split at ply=2 as you are searching those moves with the wrong bounds until the first ply=2 move is completed. The main advantage to splitting at the root is that now all of your search overhead comes from the first move only, the remaining root moves are guaranteed to have to be searched. The only issue is when one of them fails high, as you need to stop all active searches and get all processors together on this new best move to resolve the score as soon as possible.

Maybe my wait for split point will be drastically reduced with splitting at root. Maybe even some of the idle time will be lower.
I didn´t split at root before because I would have to handle the "parallel output problem". I am not really able any more to tell the gui which node is currently searched as that may be multiples.

There are issues. Displaying PVs is but one. Multiple simultaneous fail highs is another. Making sure that before you "time out" you at least complete every search that is currently in progress at the root to avoid missing a new best move.

On the issue "not splitting at cut nodes". I am not sure if this is allways right. Imagine the following situation:

Most cut nodes I split have a hash move (at least 3-4 plys away from horizont, thus enough possibilities to store a move at hash)
I do not split before the first move is fully searched.

OK. We usually call those "weak cut nodes". The first move didn't cause a fail-high, but some move will because move ordering is not optimal. The risk of parallel searching there is still high, but not nearly as high as at a real CUT node. That is what the YBW idea addresses, for most programs, over 90% of the time if a node is a CUT node, it fails high on the first branch. Which means that "weak CUT nodes" might well turn into ALL nodes but at least they are going to need more than one move searched 10% of the time.

Thus in effect I split at node that should be a cut node but the hash move does not leads to a cut. Maybe it is still a cut move but I have to search some more moves to find out.
Here I have the following possibilities:
1. The second move leads to a cutoff (bad idea to split)
2. Some later move leads to a cutoff (not such a bad idea to split as with a strong late move reduction most nodes are searched only at a few early moves thus not such a big difference to an all node)
3. It is indeed an all node. Then it is a very good idea to split here

Analyzing split positions I found out that probable all nodes that turns out to be cut nodes often searches all moves or at least a large amount of moves. The reason is that the move with a value >= beta needs such a large amount of nodes that all other moves are searched in parallel (late moves are searched up to two plys less deep having much less nodes).

Correct. But that extra work done hurts. If you could do this perfectly, you would have all processors search together on the one move that is going to fail high. Of course that is impossible to do, but it would be nice.

Thus maybe It is a good idea to find those nodes that starts as cut nodes but turns out to be all nodes very fast and before searching other moves at the split position below.

by **Tord Romstad** » 03 Oct 2007, 21:39

bob wrote:
Volker Böhm wrote:I do not split:
1. At root

Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...

Intuitively, I would also expect the root to be the best place to split. In practice, every time I have tried splitting at the root (I must have tried it at least a dozen times, with slightly different implementations each time), my search has always turned out to be slightly, but measurably weaker than when only splitting at internal nodes. I am not sure whether this is because splitting at the root screws up my time management, or if there is some other explanation.

What experiences do the rest of you have with splitting at the root?

Tord

by **Zach Wegner** » 04 Oct 2007, 23:43

My experience is that splitting at the root would be an implementational nightmare because I have iterative search. I would need a completely different set of functions for splitting at the root node, as it is outside the internal stack. I've heard mixed things about splitting at the root, so I thought it would be fine to just leave it out.

by **bob** » 05 Oct 2007, 03:04

Tord Romstad wrote:
bob wrote:
Volker Böhm wrote:I do not split:
1. At root

Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...

Intuitively, I would also expect the root to be the best place to split. In practice, every time I have tried splitting at the root (I must have tried it at least a dozen times, with slightly different implementations each time), my search has always turned out to be slightly, but measurably weaker than when only splitting at internal nodes. I am not sure whether this is because splitting at the root screws up my time management, or if there is some other explanation.

What experiences do the rest of you have with splitting at the root?

Tord

Here's the two effects that can be easily measured.

(1) splitting at the root introduces the lowest possible level of search overhead. Trees searched to a fixed depth will always be smaller (with the possible exception of some very rare and pathological position of course) than trees where each root move is searched one at a time using all processors.

(2) Searching each root move one at a time offers a potentially quicker way to change to a new best move, assuming an iteration can be interrupted in the middle if time runs out. Splitting at the root has the risk of getting one CPU on the move that would be the best, and it could get stuck there. If you don't take special care (I do) to make sure that once you start searching any single root move you guarantee yourself that you will finish it before giving up on time, then you can miss a best move. But in doing this, you will have significantly higher search overhead...

In crafty it can be turned on and off if you want to see the difference for some test cases. "smproot=0" stops splitting at the root, "smproot=1" (default) does split at the root. I have run several matches to test this and splitting at the root, the way I have it implemented, is better in terms of results against several programs... YMMV.

by **Pradu** » 05 Oct 2007, 03:23

Zach Wegner wrote:My experience is that splitting at the root would be an implementational nightmare because I have iterative search. I would need a completely different set of functions for splitting at the root node, as it is outside the internal stack. I've heard mixed things about splitting at the root, so I thought it would be fine to just leave it out.

Do you still imlement your iterative search with switches? I used computed goto's to make it look identical to the recursive search (just made defines like RETURN() and SEARCH()) for mine but apparently the only compilers that support it are the GCC compatible ones. Do you think there's a less tedious way to implement an iterative search without switches to make it look just like a recursive search? Like when using switches, you have to create a unique label in your code and put the label in a switch table which can get tedious. Is there a "cross-compiler" way to do it without having to manually create labels or update switch tables?

by **Zach Wegner** » 05 Oct 2007, 18:44

Yes, I use switches. What I do is have a switch at the beginning of a while loop that just specifies a goto label to jump to. I also have SEARCH and RETURN macros, so it looks pretty close to a recursive search. I posted a very simplified version of my iterative search a while back, so you can see what I mean:
http://wbforum.vpittlik.org/viewtopic.p ... 902&t=6273

It is rather tedious to have to create a new label, a new enum element, and a new string that is equivalent to the enum. However, I don't often add new search calls, so I don't mind.

by **Daniel Shawul** » 06 Oct 2007, 11:44

My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.

positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all: processors [1] Nodes Time NPS splits bad ===== ==== === ====== === 58283899 93.63 622524 0 0 27433866 44.08 622393 0 0 28851711 47.52 607199 0 0 249761880 315.53 791560 0 0 24464960 37.25 656777 0 0 40290193 65.91 611328 0 0 60405632 100.69 599934 0 0 28857209 50.52 571248 0 0 22668171 33.16 683682 0 0 178305924 236.08 755283 0 0 329401859 444.95 740307 0 0 227672095 362.27 628468 0 0 30761111 51.03 602792 0 0 82495233 143.56 574627 0 0 processors [2] Nodes Time NPS splits bad ===== ==== === ====== === 44606997 42.30 1054613 1251 53 26761655 25.34 1055936 1033 46 27040064 26.67 1013799 1656 114 241241788 191.00 1263046 2180 213 22801847 21.61 1055201 1341 111 50013215 48.48 1031540 2398 186 25312763 25.67 986006 2089 140 40773098 42.16 967195 2007 169 19681101 17.81 1104872 1763 119 241089275 199.66 1207517 1408 148 363072863 306.84 1183249 2417 257 99259641 92.39 1074343 2518 259 26021375 26.02 1000206 1794 99 90086426 93.28 965742 2959 255

by **bob** » 06 Oct 2007, 22:29

Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all
processors [1] Nodes Time NPS splits bad ===== ==== === ====== === 58283899 93.63 622524 0 0 27433866 44.08 622393 0 0 28851711 47.52 607199 0 0 249761880 315.53 791560 0 0 24464960 37.25 656777 0 0 40290193 65.91 611328 0 0 60405632 100.69 599934 0 0 28857209 50.52 571248 0 0 22668171 33.16 683682 0 0 178305924 236.08 755283 0 0 329401859 444.95 740307 0 0 227672095 362.27 628468 0 0 30761111 51.03 602792 0 0 82495233 143.56 574627 0 0 processors [2] Nodes Time NPS splits bad ===== ==== === ====== === 44606997 42.30 1054613 1251 53 26761655 25.34 1055936 1033 46 27040064 26.67 1013799 1656 114 241241788 191.00 1263046 2180 213 22801847 21.61 1055201 1341 111 50013215 48.48 1031540 2398 186 25312763 25.67 986006 2089 140 40773098 42.16 967195 2007 169 19681101 17.81 1104872 1763 119 241089275 199.66 1207517 1408 148 363072863 306.84 1183249 2417 257 99259641 92.39 1074343 2518 259 26021375 26.02 1000206 1794 99 90086426 93.28 965742 2959 255

I first choose a set of positions, then I run and average them all. You need to include the good, the bad and the ugly, or your speedup value won't match what you get in real games...

by **Volker Böhm** » 09 Oct 2007, 12:31

Hi,

I use 100 well known positions. As I calculate all to a certain depth I overweight positions with large amount of nodes as I compare the whole time to calculate all 100 positions.

Greetings Volker

by **bob** » 09 Oct 2007, 15:43

bob wrote:
Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all
processors [1] Nodes Time NPS splits bad ===== ==== === ====== === 58283899 93.63 622524 0 0 27433866 44.08 622393 0 0 28851711 47.52 607199 0 0 249761880 315.53 791560 0 0 24464960 37.25 656777 0 0 40290193 65.91 611328 0 0 60405632 100.69 599934 0 0 28857209 50.52 571248 0 0 22668171 33.16 683682 0 0 178305924 236.08 755283 0 0 329401859 444.95 740307 0 0 227672095 362.27 628468 0 0 30761111 51.03 602792 0 0 82495233 143.56 574627 0 0 processors [2] Nodes Time NPS splits bad ===== ==== === ====== === 44606997 42.30 1054613 1251 53 26761655 25.34 1055936 1033 46 27040064 26.67 1013799 1656 114 241241788 191.00 1263046 2180 213 22801847 21.61 1055201 1341 111 50013215 48.48 1031540 2398 186 25312763 25.67 986006 2089 140 40773098 42.16 967195 2007 169 19681101 17.81 1104872 1763 119 241089275 199.66 1207517 1408 148 363072863 306.84 1183249 2417 257 99259641 92.39 1074343 2518 259 26021375 26.02 1000206 1794 99 90086426 93.28 965742 2959 255

I first choose a set of positions, then I run and average them all. You need to include the good, the bad and the ugly, or your speedup value won't match what you get in real games...

The above has a strange look. Anytime you do "stops" you add search overhead for whatever was done before you stop a thread. Yet in many cases, your parallel search is searching _fewer_ total nodes than the serial search. That's unusual... Might suggest a bug where you are failing to search some of the parallel branches or something.

by **Daniel Shawul** » 10 Oct 2007, 07:32

I may have a bug but I don't think scorpio would have survived in the tournaments if it skips searching part of the tree. At lower depth sd=10
it also gives me results where some of the positions take longer time to be searched by the dual cpus. I can not quite explain this. You mentioned something like this cound happen in your DTS paper , but I don't think that is the case here. Here is the result at sd = 10.

Code: Select all: Nodes Time NPS splits bad ===== ==== === ====== === 698881 1.13 621227 0 0 1098954 1.64 669685 0 0 2811444 4.41 638094 0 0 9713942 12.03 807342 0 0 2121500 3.30 643463 0 0 1798382 2.89 622277 0 0 2508329 4.08 615088 0 0 1956868 3.23 605092 0 0 2650588 3.88 684022 0 0 4807810 6.47 743207 0 0 8522757 12.06 706579 0 0 8352038 12.77 654240 0 0 1057393 1.64 644358 0 0 6230894 11.30 551552 0 0 processors [2] Nodes Time NPS splits bad ===== ==== === ====== === 789580 0.81 972389 218 8 1121777 1.08 1040609 192 8 2826711 2.78 1016071 458 30 8604605 6.86 1254315 238 15 2361708 2.33 1014479 349 36 1782910 1.83 975333 367 37 4032714 4.06 992790 507 42 2084108 2.16 966654 470 49 2323604 2.17 1069799 520 54 9168502 7.75 1183032 385 60 6263702 5.58 1122929 474 60 10273661 10.14 1013081 618 81 1272470 1.25 1017976 463 21 5880992 6.45 911357 501 82

Ratio for nps,time and total nodes searched. The last raw gives the average values.

Code: Select all: Node ratio Time ratio NPS ratio 1.129777459 1.395061728 78.26358159 1.02076793 1.518518519 77.6939158 1.005430306 1.586330935 79.61765821 0.885799503 1.753644315 77.68176312 1.113225548 1.416309013 78.82962968 0.991396711 1.579234973 78.36807402 1.607729289 1.004926108 80.70308639 1.06502227 1.49537037 79.8766138 0.876637184 1.788018433 78.19916611 1.907001733 0.83483871 79.58967017 0.734938471 2.161290323 79.46238142 1.230078335 1.259368836 77.42426327 1.203403087 1.312 78.99149231 0.943844013 1.751937984 82.61750479 1.122503703 1.489775018 79.09420005

there are 5 cases where the dual cpus search less nodes than the single cpu , and those are the ones who give good speed ups :shock:

I am also worried about the NPS scaling. In this thread you mentioned that the NPS scaling should preferably be close to 100% but mine is just 80%.

by **bob** » 10 Oct 2007, 21:39

Reducing the size of the tree in a parallel search happens, but it should not be very common. If it is, you should be seeing speedups > N for N processors, because in addition to doing N times as much work per cycle, you are doing _less_ work as well.

First thing that jumps out is that your NPS is not scaling very well... Here are two test runs from my core-2 duo laptop (2.0ghz):

First position is kopec 23, searched to depth=16, a fairly typical middlegame position, nothing unusual going on. NPS scales almost 2.0X as it should. Speedup is OK.

log.001: time=50.00 mat=0 n=102189338 fh=91% nps=2.0M
log.002: time=31.89 mat=0 n=124583865 fh=91% nps=3.9M

The next is kopec 22, also to depth 16. Easier position as the best move is easy to find and much better than the rest. NPS scaling is perfect at 2.0X, and it shows a super-linear speedup which is common on this particular position although it doesn't happen every time.

log.001: time=23.94 mat=0 n=49646102 fh=93% nps=2.1M
log.002: time=8.74 mat=0 n=36307528 fh=93% nps=4.2M

Here are four more 2-cpu runs for comparison, all are the kopec22 position so I won't re-run the single-cpu test since it would not change.

log.001: time=9.01 mat=0 n=37422282 fh=93% nps=4.2M
log.002: time=9.29 mat=0 n=38404624 fh=93% nps=4.1M
log.003: time=8.91 mat=0 n=36990603 fh=93% nps=4.2M
log.004: time=8.96 mat=0 n=37177598 fh=93% nps=4.1M

You probably should first get your NPS to scale as close to 2.0 as possible, before worrying about anything else, as that gives you an instant upper bound on overall parallel search speed before you even look at how your parallel search itself performs...

by **Daniel Shawul** » 16 Oct 2007, 15:27

I was able to increase the nps count to 90%. After I read what you wrote in this thread , that the nps issue is related to cache coherency problems, i rememberd that I have many volative variables that I use for counting!

Code: Select all: VOLATILE UBMP32 poll_nodes; VOLATILE UBMP32 nodes; VOLATILE UBMP32 qnodes; VOLATILE UBMP32 time_check; VOLATILE UBMP32 full_evals; VOLATILE UBMP32 lazy_evals; VOLATILE UBMP32 splits; VOLATILE UBMP32 bad_splits; VOLATILE UBMP32 egbb_probes;

After making this non-volatile and updating the counts by adding those from helper threads really increased the nps. The fact that it is still 90% means there is still some job to do, but now I know where to look for

One other thing I do is that updating of bounds, killer moves and pv to the master thread is done when a helper find a better move, not necessarily when it runs out of work. I think in crafty you do the update only once at the end. This I think is to reduce synchronization overhead.
I thought wrongly that communications should be minimal for message-passing systmes, which I now realize cache misses could cause significant overhead in SMP systems also. Getting this done I think will put the final nail.

A possible explanation for the frequent phenomenon that the dual cpu search tends to search less nodes is due to the frequent bound updates I mentioned above. Lesser tree to search with low nps. This is an optimization issue I guess. Which one due you think is better?

A result of the newer version on the bratko-kopec positions, which you use as a bench mark in crafty ,with sd = 16 gives me the following result

Code: Select all: Nodes Time NPS splits bad ===== ==== === ====== === overhead speed up nps scale 32504569 38.42 845988 0 0 1.049852838 1.552950687 1.630784361 119754852 183.92 651117 0 0 1.086837175 1.606428509 1.746022604 3842825 3.61 1064494 0 0 1.033329387 1.271126761 1.311645721 114497531 178.64 640936 0 0 0.831446889 2.06305578 1.715272664 83860114 139.63 600609 0 0 0.792151439 2.191305712 1.735811485 108824852 183.63 592647 0 0 1.200338228 1.505410723 1.80689348 processors [2] 0.992125314 1.783830282 1.72695 Nodes Time NPS splits bad ===== ==== === ====== === 34125014 24.74 1379624 2318 172 130154025 114.49 1136865 6990 470 3970904 2.84 1396239 1228 110 95198616 86.59 1099380 2968 275 66429910 63.72 1042544 3208 177 130626630 121.98 1070850 3403 270

The average values at the right are calculated ignoring the 3rd positon which took very few seconds to search to depth 16. The nps scale is 1.73/2 = 86% approximately(but i had it to 92 in other positons so no problem) and the speed is 1.78. You can see that I get superlinear speed ups in positions where less nodes are searched by the dual cpus. As I got in this result, I think that the nps scaling (1.73)will not be an upper bound on the speed up(1.78) if many communications are done between processors.

Winboard Forum

Tuning parallel algorithms; Some questions

Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Re: Tuning parallel algorithms; Some questions

Who is online