Tuning parallel algorithms; Some questions

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

Tuning parallel algorithms; Some questions

Postby Volker Böhm » 28 Sep 2007, 10:32

Hi,

currently I am working at tuning the parallel algorithm of Spike using a quad-core (opteron) computer - no I haven´t got one of the new 4 core opteron chips.

I am currently not able to archive a speedup > 3. As far as I remember Bob Hyatt wrote somewhere that he gained a speedup of about 3.8.

I do think that a strong move reduktion algorithm will cut down the speedup as only few moves have large node counts.

Measurement of speedup:
For speedup measurement I use the 100 test positions and calculate all of them to a fixed depth. I do that with the single cpu version and compare the runtime to the 4 cpu version. I currently use depth 14 that´ll give me about 10 seconds for a position.

I wonder which speedup you get and what are your "triggers" for speedup.
In Spike I have the following results:

Average node factor: ca. 1.19
Average nps factor: ca. 3.5
Average speedup: ca. 2.9

The results varies between 2.7 and 3.1 thus 2.9 is an average of 30 searches of all 100 test positions.

The nps factor gets better with longer searches pushing the speedup to a little more than 3.

I do loose time waiting for a suitable split point. There are possibilities to grow the nps factor by using bad split points but this has a tradeoff in the node factor.

positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

Debugging is hard. Lately I found a bug in terminating helper threads on beta-cutoffs in the split points. Thus maybe there are bugs left.

About Spike MP Algorithm:
Spike uses one thread for every cpu and one thread for the input handling. Threads without work are scanning for work offer of the threads with work. If there are multiple work offers they try to pick the best one.
A thread that terminated work is able to help his helper. Alpha >= Beta will terminate every thread in this brance, helper, helper-helper, ...
I currently don´t terminate calulations in pv nodes on alpha increase and start them again with smaller windows. My last implementation didn´t gain a speedup.

What are your experiences with mp speedups?

Greetings Volker
Volker B?hm, Spike Team
Volker Böhm
 
Posts: 66
Joined: 27 Sep 2004, 18:52
Location: Germany

Re: Tuning parallel algorithms; Some questions

Postby bob » 28 Sep 2007, 17:02

Volker Böhm wrote:Hi,

currently I am working at tuning the parallel algorithm of Spike using a quad-core (opteron) computer - no I haven´t got one of the new 4 core opteron chips.

I am currently not able to archive a speedup > 3. As far as I remember Bob Hyatt wrote somewhere that he gained a speedup of about 3.8.


Wasn't me. My speedup roughly follows this:

speedup = 1 + .7 * (NCPUS - 1)

For 4 processors that turns into 3.1 or so. Tests posted here a couple of years ago put it closer to 3.3, but the formula is not that far off.




I do think that a strong move reduktion algorithm will cut down the speedup as only few moves have large node counts.

Measurement of speedup:
For speedup measurement I use the 100 test positions and calculate all of them to a fixed depth. I do that with the single cpu version and compare the runtime to the 4 cpu version. I currently use depth 14 that´ll give me about 10 seconds for a position.

I wonder which speedup you get and what are your "triggers" for speedup.
In Spike I have the following results:

Average node factor: ca. 1.19
Average nps factor: ca. 3.5
Average speedup: ca. 2.9

The results varies between 2.7 and 3.1 thus 2.9 is an average of 30 searches of all 100 test positions.

The nps factor gets better with longer searches pushing the speedup to a little more than 3.


Sounds like you are doing pretty well. First thing to do is to get your NPS up to 4X the single-cpu NPS. That can be done. Otherwise you are losing 1/2 of one CPU before you get started and you have already given yourself a max speedup of 3.5... This is called "NPS scaling" and 100% is desirable. Usually it is a cache-based issue that you just have to work on to avoid interactions between processors that are not intended.



I do loose time waiting for a suitable split point. There are possibilities to grow the nps factor by using bad split points but this has a tradeoff in the node factor.

positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont


The second two above are not good ideas. You _never_ want to split at a CUT node. Never. Just because you have winning captures left is also not a good reason to split. It only takes one winning capture to produce a cutoff, so why search N in parallel and waste the time>?

In crafty, the last point is tunable. There is a split cost that has to be absorbed. Splitting too near the tips increases the number of splits, and hence the cost that has to be absorbed...


negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

Debugging is hard. Lately I found a bug in terminating helper threads on beta-cutoffs in the split points. Thus maybe there are bugs left.

About Spike MP Algorithm:
Spike uses one thread for every cpu and one thread for the input handling. Threads without work are scanning for work offer of the threads with work. If there are multiple work offers they try to pick the best one.
A thread that terminated work is able to help his helper. Alpha >= Beta will terminate every thread in this brance, helper, helper-helper, ...
I currently don´t terminate calulations in pv nodes on alpha increase and start them again with smaller windows. My last implementation didn´t gain a speedup.

What are your experiences with mp speedups?

Greetings Volker
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Volker Böhm » 29 Sep 2007, 20:29

Hi Bob,

thanks for your hints. But I think I have some other problems with spike. I´ve done more time analysis (using amd CodeAnalyst). When starting a search I have a NPS factor of 3 after 1 sec. it gets 3.5 after 10 sec. and to 3.7 after 1 Miniute. Longer search doesn´t get better.

In 1 minute searches I loose 1- 3.7/4 = 7,5% of total cpu time compared to the optimal factor of 4. These 7,5% seperates in following "problems":
1%: Time not used for spike (kernel, ...)
1%: Hash set and get compared to the single version
1%: Move generator (spike has a lasy move generator that is not suitable for mp split positions. Thus on potential split positions I generate all moves)
3-4%: time to find and compare split postions.
The rest, don´t know

In an one second search the time to find and compare split positions increase to 20% of total cpu time.

Thus my problem is that I use too much time to find split postions. There are many cases where there are no split positons. Thus I think that a large amount of time is used by looping until a split position is present. Maybe I do something non optimal here?

I do not split:
1. At root
2. The last two plys if futility pruning is possible
3. At PV ply if not at least one move is finished
4. At All ply if not at least one move is finished
5. At Cut ply if there is any other split possiblitiy or if there is not at least one move finished

Not using Cut plys at all does increase my problem of threads doing nothing.

How does crafty behave in this issue, how many time is used by doing nothing in an one second search? (2*2 Opteron, 2 GHZ).

Greetings Volker
Volker B?hm, Spike Team
Volker Böhm
 
Posts: 66
Joined: 27 Sep 2004, 18:52
Location: Germany

Re: Tuning parallel algorithms; Some questions

Postby bob » 30 Sep 2007, 03:44

Volker Böhm wrote:Hi Bob,

thanks for your hints. But I think I have some other problems with spike. I´ve done more time analysis (using amd CodeAnalyst). When starting a search I have a NPS factor of 3 after 1 sec. it gets 3.5 after 10 sec. and to 3.7 after 1 Miniute. Longer search doesn´t get better.


OK. That sounds like you have excessive waiting at shallow iterations, that slowly drops as you get to deeper iterations. I generally split until remaining depth is 3-4 plies or less (the number varies depending on the processors and speed, but 3-4 is typical). If I make that more conservative, I get fewer wasted splits, and lower NPS as well, and eventually I will reach a point where I get a longer time to solution. My test is always to run several different kinds of test positions, and choose the split criteria that gives me the lowest time to fixed depth for the entire set.


In 1 minute searches I loose 1- 3.7/4 = 7,5% of total cpu time compared to the optimal factor of 4. These 7,5% seperates in following "problems":
1%: Time not used for spike (kernel, ...)



What is that? Crafty generally gets 399% out of a 4 cpu system. I see no reason for 1%+ kernel overhead in a chess engine that ought to be computing chess rather than making lots of system calls.


1%: Hash set and get compared to the single version


What is that? I don't have any multi-cpu hash overhead so I am not sure what you mean there...

1%: Move generator (spike has a lasy move generator that is not suitable for mp split positions. Thus on potential split positions I generate all moves)


Why won't that work on split nodes? I generate captures and then non-captures, even at split points...

3-4%: time to find and compare split postions.
The rest, don´t know

In an one second search the time to find and compare split positions increase to 20% of total cpu time.

Thus my problem is that I use too much time to find split postions. There are many cases where there are no split positons. Thus I think that a large amount of time is used by looping until a split position is present. Maybe I do something non optimal here?

I do not split:
1. At root


Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...


2. The last two plys if futility pruning is possible
3. At PV ply if not at least one move is finished
4. At All ply if not at least one move is finished
5. At Cut ply if there is any other split possiblitiy or if there is not at least one move finished

Not using Cut plys at all does increase my problem of threads doing nothing.


Main issue there is how accurate is your ALL node identification code? If it is good, then splitting at a CUT node is worthless since you should be right there as well. And splitting at a real CUT node is no good for obvious reasons...


How does crafty behave in this issue, how many time is used by doing nothing in an one second search? (2*2 Opteron, 2 GHZ).



I no longer measure this because measuring CPU time is difficult across a wide variety of platforms. But in years gone by, the number was _very_ low. I used to see 399% cpu utilization in blitz games played on ICC. I don't currently have code to measure the amount of time spent "idle".


Greetings Volker
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Volker Böhm » 01 Oct 2007, 08:48

Hi,

with "doing nothing" I didn´t mean having idle time, but waiting for split points in an endless loop consuming cpu power. I do have a little idle problem too but compared to the "no split position problem".

Maybe splitting at root solves much of my problems. Will try that next. Some thoughts:
When the first move is searched at root I get a "guaranteed" all node at every first answer move (the leftmost branch at the search tree). Thus there are nearly allways suitable split points when searching the first node at root. (Not really "guaranteed all node" for spike as I use a shrunk search window allready at root). Thus the problem should be only at every but the first root move. The answer move of a "non-first" root move is typically a cut move - thus again no split point.
Maybe my wait for split point will be drastically reduced with splitting at root. Maybe even some of the idle time will be lower.
I didn´t split at root before because I would have to handle the "parallel output problem". I am not really able any more to tell the gui which node is currently searched as that may be multiples.

On the issue "not splitting at cut nodes". I am not sure if this is allways right. Imagine the following situation:

Most cut nodes I split have a hash move (at least 3-4 plys away from horizont, thus enough possibilities to store a move at hash)
I do not split before the first move is fully searched.

Thus in effect I split at node that should be a cut node but the hash move does not leads to a cut. Maybe it is still a cut move but I have to search some more moves to find out.
Here I have the following possibilities:
1. The second move leads to a cutoff (bad idea to split)
2. Some later move leads to a cutoff (not such a bad idea to split as with a strong late move reduction most nodes are searched only at a few early moves thus not such a big difference to an all node)
3. It is indeed an all node. Then it is a very good idea to split here

Analyzing split positions I found out that probable all nodes that turns out to be cut nodes often searches all moves or at least a large amount of moves. The reason is that the move with a value >= beta needs such a large amount of nodes that all other moves are searched in parallel (late moves are searched up to two plys less deep having much less nodes).
Thus maybe It is a good idea to find those nodes that starts as cut nodes but turns out to be all nodes very fast and before searching other moves at the split position below.
Volker B?hm, Spike Team
Volker Böhm
 
Posts: 66
Joined: 27 Sep 2004, 18:52
Location: Germany

Re: Tuning parallel algorithms; Some questions

Postby bob » 01 Oct 2007, 17:18

Volker Böhm wrote:Hi,

with "doing nothing" I didn´t mean having idle time, but waiting for split points in an endless loop consuming cpu power. I do have a little idle problem too but compared to the "no split position problem".

Maybe splitting at root solves much of my problems. Will try that next. Some thoughts:
When the first move is searched at root I get a "guaranteed" all node at every first answer move (the leftmost branch at the search tree). Thus there are nearly allways suitable split points when searching the first node at root. (Not really "guaranteed all node" for spike as I use a shrunk search window allready at root). Thus the problem should be only at every but the first root move. The answer move of a "non-first" root move is typically a cut move - thus again no split point.



A couple of points. First, ply=2 is not guaranteed to be an ALL node. In fact, except for the first root move, they are hopefully all going to be CUT nodes.

However, even for the first root move, you don't want to initially split at ply=2 as you are searching those moves with the wrong bounds until the first ply=2 move is completed. The main advantage to splitting at the root is that now all of your search overhead comes from the first move only, the remaining root moves are guaranteed to have to be searched. The only issue is when one of them fails high, as you need to stop all active searches and get all processors together on this new best move to resolve the score as soon as possible.


Maybe my wait for split point will be drastically reduced with splitting at root. Maybe even some of the idle time will be lower.
I didn´t split at root before because I would have to handle the "parallel output problem". I am not really able any more to tell the gui which node is currently searched as that may be multiples.


There are issues. Displaying PVs is but one. Multiple simultaneous fail highs is another. Making sure that before you "time out" you at least complete every search that is currently in progress at the root to avoid missing a new best move.


On the issue "not splitting at cut nodes". I am not sure if this is allways right. Imagine the following situation:

Most cut nodes I split have a hash move (at least 3-4 plys away from horizont, thus enough possibilities to store a move at hash)
I do not split before the first move is fully searched.


OK. We usually call those "weak cut nodes". The first move didn't cause a fail-high, but some move will because move ordering is not optimal. The risk of parallel searching there is still high, but not nearly as high as at a real CUT node. That is what the YBW idea addresses, for most programs, over 90% of the time if a node is a CUT node, it fails high on the first branch. Which means that "weak CUT nodes" might well turn into ALL nodes but at least they are going to need more than one move searched 10% of the time.


Thus in effect I split at node that should be a cut node but the hash move does not leads to a cut. Maybe it is still a cut move but I have to search some more moves to find out.
Here I have the following possibilities:
1. The second move leads to a cutoff (bad idea to split)
2. Some later move leads to a cutoff (not such a bad idea to split as with a strong late move reduction most nodes are searched only at a few early moves thus not such a big difference to an all node)
3. It is indeed an all node. Then it is a very good idea to split here

Analyzing split positions I found out that probable all nodes that turns out to be cut nodes often searches all moves or at least a large amount of moves. The reason is that the move with a value >= beta needs such a large amount of nodes that all other moves are searched in parallel (late moves are searched up to two plys less deep having much less nodes).


Correct. But that extra work done hurts. If you could do this perfectly, you would have all processors search together on the one move that is going to fail high. Of course that is impossible to do, but it would be nice.

Thus maybe It is a good idea to find those nodes that starts as cut nodes but turns out to be all nodes very fast and before searching other moves at the split position below.
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Tord Romstad » 03 Oct 2007, 21:39

bob wrote:
Volker Böhm wrote:I do not split:
1. At root


Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...

Intuitively, I would also expect the root to be the best place to split. In practice, every time I have tried splitting at the root (I must have tried it at least a dozen times, with slightly different implementations each time), my search has always turned out to be slightly, but measurably weaker than when only splitting at internal nodes. I am not sure whether this is because splitting at the root screws up my time management, or if there is some other explanation.

What experiences do the rest of you have with splitting at the root?

Tord
User avatar
Tord Romstad
 
Posts: 639
Joined: 09 Oct 2004, 12:49
Location: Oslo, Norway

Re: Tuning parallel algorithms; Some questions

Postby Zach Wegner » 04 Oct 2007, 23:43

My experience is that splitting at the root would be an implementational nightmare because I have iterative search. I would need a completely different set of functions for splitting at the root node, as it is outside the internal stack. I've heard mixed things about splitting at the root, so I thought it would be fine to just leave it out.
User avatar
Zach Wegner
 
Posts: 182
Joined: 26 Sep 2004, 22:02
Location: Austin, Texas, USA

Re: Tuning parallel algorithms; Some questions

Postby bob » 05 Oct 2007, 03:04

Tord Romstad wrote:
bob wrote:
Volker Böhm wrote:I do not split:
1. At root


Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...

Intuitively, I would also expect the root to be the best place to split. In practice, every time I have tried splitting at the root (I must have tried it at least a dozen times, with slightly different implementations each time), my search has always turned out to be slightly, but measurably weaker than when only splitting at internal nodes. I am not sure whether this is because splitting at the root screws up my time management, or if there is some other explanation.

What experiences do the rest of you have with splitting at the root?

Tord


Here's the two effects that can be easily measured.

(1) splitting at the root introduces the lowest possible level of search overhead. Trees searched to a fixed depth will always be smaller (with the possible exception of some very rare and pathological position of course) than trees where each root move is searched one at a time using all processors.

(2) Searching each root move one at a time offers a potentially quicker way to change to a new best move, assuming an iteration can be interrupted in the middle if time runs out. Splitting at the root has the risk of getting one CPU on the move that would be the best, and it could get stuck there. If you don't take special care (I do) to make sure that once you start searching any single root move you guarantee yourself that you will finish it before giving up on time, then you can miss a best move. But in doing this, you will have significantly higher search overhead...

In crafty it can be turned on and off if you want to see the difference for some test cases. "smproot=0" stops splitting at the root, "smproot=1" (default) does split at the root. I have run several matches to test this and splitting at the root, the way I have it implemented, is better in terms of results against several programs... YMMV.
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Pradu » 05 Oct 2007, 03:23

Zach Wegner wrote:My experience is that splitting at the root would be an implementational nightmare because I have iterative search. I would need a completely different set of functions for splitting at the root node, as it is outside the internal stack. I've heard mixed things about splitting at the root, so I thought it would be fine to just leave it out.
Do you still imlement your iterative search with switches? I used computed goto's to make it look identical to the recursive search (just made defines like RETURN() and SEARCH()) for mine but apparently the only compilers that support it are the GCC compatible ones. Do you think there's a less tedious way to implement an iterative search without switches to make it look just like a recursive search? Like when using switches, you have to create a unique label in your code and put the label in a switch table which can get tedious. Is there a "cross-compiler" way to do it without having to manually create labels or update switch tables?
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Tuning parallel algorithms; Some questions

Postby Zach Wegner » 05 Oct 2007, 18:44

Yes, I use switches. What I do is have a switch at the beginning of a while loop that just specifies a goto label to jump to. I also have SEARCH and RETURN macros, so it looks pretty close to a recursive search. I posted a very simplified version of my iterative search a while back, so you can see what I mean:
http://wbforum.vpittlik.org/viewtopic.p ... 902&t=6273

It is rather tedious to have to create a new label, a new enum element, and a new string that is equivalent to the enum. However, I don't often add new search calls, so I don't mind.
User avatar
Zach Wegner
 
Posts: 182
Joined: 26 Sep 2004, 22:02
Location: Austin, Texas, USA

Re: Tuning parallel algorithms; Some questions

Postby Daniel Shawul » 06 Oct 2007, 11:44

My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all
processors [1]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
58283899    93.63   622524        0        0
27433866    44.08   622393        0        0
28851711    47.52   607199        0        0
249761880   315.53   791560        0        0
24464960    37.25   656777        0        0
40290193    65.91   611328        0        0
60405632   100.69   599934        0        0
28857209    50.52   571248        0        0
22668171    33.16   683682        0        0
178305924   236.08   755283        0        0
329401859   444.95   740307        0        0
227672095   362.27   628468        0        0
30761111    51.03   602792        0        0
82495233   143.56   574627        0        0
processors [2]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
44606997    42.30  1054613     1251       53
26761655    25.34  1055936     1033       46
27040064    26.67  1013799     1656      114
241241788   191.00  1263046     2180      213
22801847    21.61  1055201     1341      111
50013215    48.48  1031540     2398      186
25312763    25.67   986006     2089      140
40773098    42.16   967195     2007      169
19681101    17.81  1104872     1763      119
241089275   199.66  1207517     1408      148
363072863   306.84  1183249     2417      257
99259641    92.39  1074343     2518      259
26021375    26.02  1000206     1794       99
90086426    93.28   965742     2959      255
User avatar
Daniel Shawul
 
Posts: 366
Joined: 28 Sep 2004, 09:33
Location: Ethiopia

Re: Tuning parallel algorithms; Some questions

Postby bob » 06 Oct 2007, 22:29

Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all
processors [1]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
58283899    93.63   622524        0        0
27433866    44.08   622393        0        0
28851711    47.52   607199        0        0
249761880   315.53   791560        0        0
24464960    37.25   656777        0        0
40290193    65.91   611328        0        0
60405632   100.69   599934        0        0
28857209    50.52   571248        0        0
22668171    33.16   683682        0        0
178305924   236.08   755283        0        0
329401859   444.95   740307        0        0
227672095   362.27   628468        0        0
30761111    51.03   602792        0        0
82495233   143.56   574627        0        0
processors [2]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
44606997    42.30  1054613     1251       53
26761655    25.34  1055936     1033       46
27040064    26.67  1013799     1656      114
241241788   191.00  1263046     2180      213
22801847    21.61  1055201     1341      111
50013215    48.48  1031540     2398      186
25312763    25.67   986006     2089      140
40773098    42.16   967195     2007      169
19681101    17.81  1104872     1763      119
241089275   199.66  1207517     1408      148
363072863   306.84  1183249     2417      257
99259641    92.39  1074343     2518      259
26021375    26.02  1000206     1794       99
90086426    93.28   965742     2959      255


I first choose a set of positions, then I run and average them all. You need to include the good, the bad and the ugly, or your speedup value won't match what you get in real games...
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Volker Böhm » 09 Oct 2007, 12:31

Hi,

I use 100 well known positions. As I calculate all to a certain depth I overweight positions with large amount of nodes as I compare the whole time to calculate all 100 positions.

Greetings Volker
Volker B?hm, Spike Team
Volker Böhm
 
Posts: 66
Joined: 27 Sep 2004, 18:52
Location: Germany

Re: Tuning parallel algorithms; Some questions

Postby bob » 09 Oct 2007, 15:43

bob wrote:
Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont

negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).

I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.

Can you give me those 100 positions you use for the test?

Code: Select all
processors [1]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
58283899    93.63   622524        0        0
27433866    44.08   622393        0        0
28851711    47.52   607199        0        0
249761880   315.53   791560        0        0
24464960    37.25   656777        0        0
40290193    65.91   611328        0        0
60405632   100.69   599934        0        0
28857209    50.52   571248        0        0
22668171    33.16   683682        0        0
178305924   236.08   755283        0        0
329401859   444.95   740307        0        0
227672095   362.27   628468        0        0
30761111    51.03   602792        0        0
82495233   143.56   574627        0        0
processors [2]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
44606997    42.30  1054613     1251       53
26761655    25.34  1055936     1033       46
27040064    26.67  1013799     1656      114
241241788   191.00  1263046     2180      213
22801847    21.61  1055201     1341      111
50013215    48.48  1031540     2398      186
25312763    25.67   986006     2089      140
40773098    42.16   967195     2007      169
19681101    17.81  1104872     1763      119
241089275   199.66  1207517     1408      148
363072863   306.84  1183249     2417      257
99259641    92.39  1074343     2518      259
26021375    26.02  1000206     1794       99
90086426    93.28   965742     2959      255


I first choose a set of positions, then I run and average them all. You need to include the good, the bad and the ugly, or your speedup value won't match what you get in real games...


The above has a strange look. Anytime you do "stops" you add search overhead for whatever was done before you stop a thread. Yet in many cases, your parallel search is searching _fewer_ total nodes than the serial search. That's unusual... Might suggest a bug where you are failing to search some of the parallel branches or something.
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Daniel Shawul » 10 Oct 2007, 07:32

I may have a bug but I don't think scorpio would have survived in the tournaments if it skips searching part of the tree. At lower depth sd=10
it also gives me results where some of the positions take longer time to be searched by the dual cpus. I can not quite explain this. You mentioned something like this cound happen in your DTS paper , but I don't think that is the case here. Here is the result at sd = 10.
Code: Select all
  Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
  698881     1.13   621227        0        0
 1098954     1.64   669685        0        0
 2811444     4.41   638094        0        0
 9713942    12.03   807342        0        0
 2121500     3.30   643463        0        0
 1798382     2.89   622277        0        0
 2508329     4.08   615088        0        0
 1956868     3.23   605092        0        0
 2650588     3.88   684022        0        0
 4807810     6.47   743207        0        0
 8522757    12.06   706579        0        0
 8352038    12.77   654240        0        0
 1057393     1.64   644358        0        0
 6230894    11.30   551552        0        0
processors [2]

   Nodes     Time    NPS        splits   bad
   =====     ====    ===        ======   ===
  789580     0.81   972389      218        8
 1121777     1.08  1040609      192        8
 2826711     2.78  1016071      458       30
 8604605     6.86  1254315      238       15
 2361708     2.33  1014479      349       36
 1782910     1.83   975333      367       37
 4032714     4.06   992790      507       42
 2084108     2.16   966654      470       49
 2323604     2.17  1069799      520       54
 9168502     7.75  1183032      385       60
 6263702     5.58  1122929      474       60
10273661    10.14  1013081      618       81
 1272470     1.25  1017976      463       21
 5880992     6.45   911357      501       82

Ratio for nps,time and total nodes searched. The last raw gives the average values.
Code: Select all
Node ratio   Time ratio   NPS ratio
1.129777459   1.395061728   78.26358159
1.02076793   1.518518519   77.6939158
1.005430306   1.586330935   79.61765821
0.885799503   1.753644315   77.68176312
1.113225548   1.416309013   78.82962968
0.991396711   1.579234973   78.36807402
1.607729289   1.004926108   80.70308639
1.06502227   1.49537037   79.8766138
0.876637184   1.788018433   78.19916611
1.907001733   0.83483871   79.58967017
0.734938471   2.161290323   79.46238142
1.230078335   1.259368836   77.42426327
1.203403087   1.312   78.99149231
0.943844013   1.751937984   82.61750479
      
1.122503703   1.489775018   79.09420005

there are 5 cases where the dual cpus search less nodes than the single cpu , and those are the ones who give good speed ups :shock:
I am also worried about the NPS scaling. In this thread you mentioned that the NPS scaling should preferably be close to 100% but mine is just 80%.
User avatar
Daniel Shawul
 
Posts: 366
Joined: 28 Sep 2004, 09:33
Location: Ethiopia

Re: Tuning parallel algorithms; Some questions

Postby bob » 10 Oct 2007, 21:39

Reducing the size of the tree in a parallel search happens, but it should not be very common. If it is, you should be seeing speedups > N for N processors, because in addition to doing N times as much work per cycle, you are doing _less_ work as well.

First thing that jumps out is that your NPS is not scaling very well... Here are two test runs from my core-2 duo laptop (2.0ghz):

First position is kopec 23, searched to depth=16, a fairly typical middlegame position, nothing unusual going on. NPS scales almost 2.0X as it should. Speedup is OK.

log.001: time=50.00 mat=0 n=102189338 fh=91% nps=2.0M
log.002: time=31.89 mat=0 n=124583865 fh=91% nps=3.9M

The next is kopec 22, also to depth 16. Easier position as the best move is easy to find and much better than the rest. NPS scaling is perfect at 2.0X, and it shows a super-linear speedup which is common on this particular position although it doesn't happen every time.

log.001: time=23.94 mat=0 n=49646102 fh=93% nps=2.1M
log.002: time=8.74 mat=0 n=36307528 fh=93% nps=4.2M

Here are four more 2-cpu runs for comparison, all are the kopec22 position so I won't re-run the single-cpu test since it would not change.

log.001: time=9.01 mat=0 n=37422282 fh=93% nps=4.2M
log.002: time=9.29 mat=0 n=38404624 fh=93% nps=4.1M
log.003: time=8.91 mat=0 n=36990603 fh=93% nps=4.2M
log.004: time=8.96 mat=0 n=37177598 fh=93% nps=4.1M

You probably should first get your NPS to scale as close to 2.0 as possible, before worrying about anything else, as that gives you an instant upper bound on overall parallel search speed before you even look at how your parallel search itself performs...
User avatar
bob
 
Posts: 156
Joined: 10 May 2006, 17:59

Re: Tuning parallel algorithms; Some questions

Postby Daniel Shawul » 16 Oct 2007, 15:27

I was able to increase the nps count to 90%. After I read what you wrote in this thread , that the nps issue is related to cache coherency problems, i rememberd that I have many volative variables that I use for counting!

Code: Select all
                VOLATILE UBMP32 poll_nodes;
   VOLATILE UBMP32 nodes;
   VOLATILE UBMP32 qnodes;
   VOLATILE UBMP32 time_check;
   VOLATILE UBMP32 full_evals;
                VOLATILE UBMP32 lazy_evals;
   VOLATILE UBMP32 splits;
   VOLATILE UBMP32 bad_splits;
   VOLATILE UBMP32 egbb_probes;

After making this non-volatile and updating the counts by adding those from helper threads really increased the nps. The fact that it is still 90% means there is still some job to do, but now I know where to look for :D

One other thing I do is that updating of bounds, killer moves and pv to the master thread is done when a helper find a better move, not necessarily when it runs out of work. I think in crafty you do the update only once at the end. This I think is to reduce synchronization overhead.
I thought wrongly that communications should be minimal for message-passing systmes, which I now realize cache misses could cause significant overhead in SMP systems also. Getting this done I think will put the final nail.

A possible explanation for the frequent phenomenon that the dual cpu search tends to search less nodes is due to the frequent bound updates I mentioned above. Lesser tree to search with low nps. This is an optimization issue I guess. Which one due you think is better?

A result of the newer version on the bratko-kopec positions, which you use as a bench mark in crafty ,with sd = 16 gives me the following result
Code: Select all
Nodes   Time   NPS   splits   bad            
=====   ====   ===   ======   ===      overhead   speed up   nps scale
32504569   38.42   845988   0   0      1.049852838   1.552950687   1.630784361
119754852   183.92   651117   0   0      1.086837175   1.606428509   1.746022604
3842825   3.61   1064494   0   0      1.033329387   1.271126761   1.311645721
114497531   178.64   640936   0   0      0.831446889   2.06305578   1.715272664
83860114   139.63   600609   0   0      0.792151439   2.191305712   1.735811485
108824852   183.63   592647   0   0      1.200338228   1.505410723   1.80689348
processors   [2]                     
                  0.992125314   1.783830282   1.72695
Nodes   Time   NPS   splits   bad            
=====   ====   ===   ======   ===            
34125014   24.74   1379624   2318   172            
130154025   114.49   1136865   6990   470            
3970904   2.84   1396239   1228   110
95198616   86.59   1099380   2968   275
66429910   63.72   1042544   3208   177
130626630   121.98   1070850   3403   270


The average values at the right are calculated ignoring the 3rd positon which took very few seconds to search to depth 16. The nps scale is 1.73/2 = 86% approximately(but i had it to 92 in other positons so no problem) and the speed is 1.78. You can see that I get superlinear speed ups in positions where less nodes are searched by the dual cpus. As I got in this result, I think that the nps scaling (1.73)will not be an upper bound on the speed up(1.78) if many communications are done between processors.
User avatar
Daniel Shawul
 
Posts: 366
Joined: 28 Sep 2004, 09:33
Location: Ethiopia


Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 15 guests