Moderator: Andres Valverde
Volker Böhm wrote:Hi,
currently I am working at tuning the parallel algorithm of Spike using a quad-core (opteron) computer - no I haven´t got one of the new 4 core opteron chips.
I am currently not able to archive a speedup > 3. As far as I remember Bob Hyatt wrote somewhere that he gained a speedup of about 3.8.
I do think that a strong move reduktion algorithm will cut down the speedup as only few moves have large node counts.
Measurement of speedup:
For speedup measurement I use the 100 test positions and calculate all of them to a fixed depth. I do that with the single cpu version and compare the runtime to the 4 cpu version. I currently use depth 14 that´ll give me about 10 seconds for a position.
I wonder which speedup you get and what are your "triggers" for speedup.
In Spike I have the following results:
Average node factor: ca. 1.19
Average nps factor: ca. 3.5
Average speedup: ca. 2.9
The results varies between 2.7 and 3.1 thus 2.9 is an average of 30 searches of all 100 test positions.
The nps factor gets better with longer searches pushing the speedup to a little more than 3.
I do loose time waiting for a suitable split point. There are possibilities to grow the nps factor by using bad split points but this has a tradeoff in the node factor.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont
negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).
Debugging is hard. Lately I found a bug in terminating helper threads on beta-cutoffs in the split points. Thus maybe there are bugs left.
About Spike MP Algorithm:
Spike uses one thread for every cpu and one thread for the input handling. Threads without work are scanning for work offer of the threads with work. If there are multiple work offers they try to pick the best one.
A thread that terminated work is able to help his helper. Alpha >= Beta will terminate every thread in this brance, helper, helper-helper, ...
I currently don´t terminate calulations in pv nodes on alpha increase and start them again with smaller windows. My last implementation didn´t gain a speedup.
What are your experiences with mp speedups?
Greetings Volker
Volker Böhm wrote:Hi Bob,
thanks for your hints. But I think I have some other problems with spike. I´ve done more time analysis (using amd CodeAnalyst). When starting a search I have a NPS factor of 3 after 1 sec. it gets 3.5 after 10 sec. and to 3.7 after 1 Miniute. Longer search doesn´t get better.
In 1 minute searches I loose 1- 3.7/4 = 7,5% of total cpu time compared to the optimal factor of 4. These 7,5% seperates in following "problems":
1%: Time not used for spike (kernel, ...)
1%: Hash set and get compared to the single version
1%: Move generator (spike has a lasy move generator that is not suitable for mp split positions. Thus on potential split positions I generate all moves)
3-4%: time to find and compare split postions.
The rest, don´t know
In an one second search the time to find and compare split positions increase to 20% of total cpu time.
Thus my problem is that I use too much time to find split postions. There are many cases where there are no split positons. Thus I think that a large amount of time is used by looping until a split position is present. Maybe I do something non optimal here?
I do not split:
1. At root
2. The last two plys if futility pruning is possible
3. At PV ply if not at least one move is finished
4. At All ply if not at least one move is finished
5. At Cut ply if there is any other split possiblitiy or if there is not at least one move finished
Not using Cut plys at all does increase my problem of threads doing nothing.
How does crafty behave in this issue, how many time is used by doing nothing in an one second search? (2*2 Opteron, 2 GHZ).
Greetings Volker
Volker Böhm wrote:Hi,
with "doing nothing" I didn´t mean having idle time, but waiting for split points in an endless loop consuming cpu power. I do have a little idle problem too but compared to the "no split position problem".
Maybe splitting at root solves much of my problems. Will try that next. Some thoughts:
When the first move is searched at root I get a "guaranteed" all node at every first answer move (the leftmost branch at the search tree). Thus there are nearly allways suitable split points when searching the first node at root. (Not really "guaranteed all node" for spike as I use a shrunk search window allready at root). Thus the problem should be only at every but the first root move. The answer move of a "non-first" root move is typically a cut move - thus again no split point.
Maybe my wait for split point will be drastically reduced with splitting at root. Maybe even some of the idle time will be lower.
I didn´t split at root before because I would have to handle the "parallel output problem". I am not really able any more to tell the gui which node is currently searched as that may be multiples.
On the issue "not splitting at cut nodes". I am not sure if this is allways right. Imagine the following situation:
Most cut nodes I split have a hash move (at least 3-4 plys away from horizont, thus enough possibilities to store a move at hash)
I do not split before the first move is fully searched.
Thus in effect I split at node that should be a cut node but the hash move does not leads to a cut. Maybe it is still a cut move but I have to search some more moves to find out.
Here I have the following possibilities:
1. The second move leads to a cutoff (bad idea to split)
2. Some later move leads to a cutoff (not such a bad idea to split as with a strong late move reduction most nodes are searched only at a few early moves thus not such a big difference to an all node)
3. It is indeed an all node. Then it is a very good idea to split here
Analyzing split positions I found out that probable all nodes that turns out to be cut nodes often searches all moves or at least a large amount of moves. The reason is that the move with a value >= beta needs such a large amount of nodes that all other moves are searched in parallel (late moves are searched up to two plys less deep having much less nodes).
Thus maybe It is a good idea to find those nodes that starts as cut nodes but turns out to be all nodes very fast and before searching other moves at the split position below.
bob wrote:Volker Böhm wrote:I do not split:
1. At root
Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...
Tord Romstad wrote:bob wrote:Volker Böhm wrote:I do not split:
1. At root
Unfortunately that is the _best_ places to split of all. That is the only guaranteed ALL node you can find in the entire tree, in fact. I've been splitting there since the beginning...
Intuitively, I would also expect the root to be the best place to split. In practice, every time I have tried splitting at the root (I must have tried it at least a dozen times, with slightly different implementations each time), my search has always turned out to be slightly, but measurably weaker than when only splitting at internal nodes. I am not sure whether this is because splitting at the root screws up my time management, or if there is some other explanation.
What experiences do the rest of you have with splitting at the root?
Tord
Do you still imlement your iterative search with switches? I used computed goto's to make it look identical to the recursive search (just made defines like RETURN() and SEARCH()) for mine but apparently the only compilers that support it are the GCC compatible ones. Do you think there's a less tedious way to implement an iterative search without switches to make it look just like a recursive search? Like when using switches, you have to create a unique label in your code and put the label in a switch table which can get tedious. Is there a "cross-compiler" way to do it without having to manually create labels or update switch tables?Zach Wegner wrote:My experience is that splitting at the root would be an implementational nightmare because I have iterative search. I would need a completely different set of functions for splitting at the root node, as it is outside the internal stack. I've heard mixed things about splitting at the root, so I thought it would be fine to just leave it out.
positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont
negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).
processors [1]
Nodes Time NPS splits bad
===== ==== === ====== ===
58283899 93.63 622524 0 0
27433866 44.08 622393 0 0
28851711 47.52 607199 0 0
249761880 315.53 791560 0 0
24464960 37.25 656777 0 0
40290193 65.91 611328 0 0
60405632 100.69 599934 0 0
28857209 50.52 571248 0 0
22668171 33.16 683682 0 0
178305924 236.08 755283 0 0
329401859 444.95 740307 0 0
227672095 362.27 628468 0 0
30761111 51.03 602792 0 0
82495233 143.56 574627 0 0
processors [2]
Nodes Time NPS splits bad
===== ==== === ====== ===
44606997 42.30 1054613 1251 53
26761655 25.34 1055936 1033 46
27040064 26.67 1013799 1656 114
241241788 191.00 1263046 2180 213
22801847 21.61 1055201 1341 111
50013215 48.48 1031540 2398 186
25312763 25.67 986006 2089 140
40773098 42.16 967195 2007 169
19681101 17.81 1104872 1763 119
241089275 199.66 1207517 1408 148
363072863 306.84 1183249 2417 257
99259641 92.39 1074343 2518 259
26021375 26.02 1000206 1794 99
90086426 93.28 965742 2959 255
Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont
negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).
I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.
Can you give me those 100 positions you use for the test?
- Code: Select all
processors [1]
Nodes Time NPS splits bad
===== ==== === ====== ===
58283899 93.63 622524 0 0
27433866 44.08 622393 0 0
28851711 47.52 607199 0 0
249761880 315.53 791560 0 0
24464960 37.25 656777 0 0
40290193 65.91 611328 0 0
60405632 100.69 599934 0 0
28857209 50.52 571248 0 0
22668171 33.16 683682 0 0
178305924 236.08 755283 0 0
329401859 444.95 740307 0 0
227672095 362.27 628468 0 0
30761111 51.03 602792 0 0
82495233 143.56 574627 0 0
processors [2]
Nodes Time NPS splits bad
===== ==== === ====== ===
44606997 42.30 1054613 1251 53
26761655 25.34 1055936 1033 46
27040064 26.67 1013799 1656 114
241241788 191.00 1263046 2180 213
22801847 21.61 1055201 1341 111
50013215 48.48 1031540 2398 186
25312763 25.67 986006 2089 140
40773098 42.16 967195 2007 169
19681101 17.81 1104872 1763 119
241089275 199.66 1207517 1408 148
363072863 306.84 1183249 2417 257
99259641 92.39 1074343 2518 259
26021375 26.02 1000206 1794 99
90086426 93.28 965742 2959 255
bob wrote:Daniel Shawul wrote:My speed up is about 1.65 for two processors. It is not that good I guess
mainly because of selective extensions and pruning.positive "tradeoffs" (i.e. nps increase is worth more than the tradeoff in node-factor):
*Splitting at cut nodes if the node has probed at least one move.
*Splitting at nodes with winning captures left.
*Splitting near horizont
negative "tradeoffs"
*Splitting at nodes where no move has been played (even at all nodes).
I tried splitting at ALL nodes even before a single move is searched and it was not much better. May be because the classification of nodes can not be done unless the node is really searched. So right now I just split after the first move is searched.
I have a question regarding result interpretion. Are positions like those which give superlinear speed up, or where the single cpu result is faster, or where the single cpu searches more nodes(I can't fully explain this) considered in calculation of the average speed up.
Here is a result of scorpio run on single and dual cpus. The test is done with sd 14.
Can you give me those 100 positions you use for the test?
- Code: Select all
processors [1]
Nodes Time NPS splits bad
===== ==== === ====== ===
58283899 93.63 622524 0 0
27433866 44.08 622393 0 0
28851711 47.52 607199 0 0
249761880 315.53 791560 0 0
24464960 37.25 656777 0 0
40290193 65.91 611328 0 0
60405632 100.69 599934 0 0
28857209 50.52 571248 0 0
22668171 33.16 683682 0 0
178305924 236.08 755283 0 0
329401859 444.95 740307 0 0
227672095 362.27 628468 0 0
30761111 51.03 602792 0 0
82495233 143.56 574627 0 0
processors [2]
Nodes Time NPS splits bad
===== ==== === ====== ===
44606997 42.30 1054613 1251 53
26761655 25.34 1055936 1033 46
27040064 26.67 1013799 1656 114
241241788 191.00 1263046 2180 213
22801847 21.61 1055201 1341 111
50013215 48.48 1031540 2398 186
25312763 25.67 986006 2089 140
40773098 42.16 967195 2007 169
19681101 17.81 1104872 1763 119
241089275 199.66 1207517 1408 148
363072863 306.84 1183249 2417 257
99259641 92.39 1074343 2518 259
26021375 26.02 1000206 1794 99
90086426 93.28 965742 2959 255
I first choose a set of positions, then I run and average them all. You need to include the good, the bad and the ugly, or your speedup value won't match what you get in real games...
Nodes Time NPS splits bad
===== ==== === ====== ===
698881 1.13 621227 0 0
1098954 1.64 669685 0 0
2811444 4.41 638094 0 0
9713942 12.03 807342 0 0
2121500 3.30 643463 0 0
1798382 2.89 622277 0 0
2508329 4.08 615088 0 0
1956868 3.23 605092 0 0
2650588 3.88 684022 0 0
4807810 6.47 743207 0 0
8522757 12.06 706579 0 0
8352038 12.77 654240 0 0
1057393 1.64 644358 0 0
6230894 11.30 551552 0 0
processors [2]
Nodes Time NPS splits bad
===== ==== === ====== ===
789580 0.81 972389 218 8
1121777 1.08 1040609 192 8
2826711 2.78 1016071 458 30
8604605 6.86 1254315 238 15
2361708 2.33 1014479 349 36
1782910 1.83 975333 367 37
4032714 4.06 992790 507 42
2084108 2.16 966654 470 49
2323604 2.17 1069799 520 54
9168502 7.75 1183032 385 60
6263702 5.58 1122929 474 60
10273661 10.14 1013081 618 81
1272470 1.25 1017976 463 21
5880992 6.45 911357 501 82
Node ratio Time ratio NPS ratio
1.129777459 1.395061728 78.26358159
1.02076793 1.518518519 77.6939158
1.005430306 1.586330935 79.61765821
0.885799503 1.753644315 77.68176312
1.113225548 1.416309013 78.82962968
0.991396711 1.579234973 78.36807402
1.607729289 1.004926108 80.70308639
1.06502227 1.49537037 79.8766138
0.876637184 1.788018433 78.19916611
1.907001733 0.83483871 79.58967017
0.734938471 2.161290323 79.46238142
1.230078335 1.259368836 77.42426327
1.203403087 1.312 78.99149231
0.943844013 1.751937984 82.61750479
1.122503703 1.489775018 79.09420005
VOLATILE UBMP32 poll_nodes;
VOLATILE UBMP32 nodes;
VOLATILE UBMP32 qnodes;
VOLATILE UBMP32 time_check;
VOLATILE UBMP32 full_evals;
VOLATILE UBMP32 lazy_evals;
VOLATILE UBMP32 splits;
VOLATILE UBMP32 bad_splits;
VOLATILE UBMP32 egbb_probes;
Nodes Time NPS splits bad
===== ==== === ====== === overhead speed up nps scale
32504569 38.42 845988 0 0 1.049852838 1.552950687 1.630784361
119754852 183.92 651117 0 0 1.086837175 1.606428509 1.746022604
3842825 3.61 1064494 0 0 1.033329387 1.271126761 1.311645721
114497531 178.64 640936 0 0 0.831446889 2.06305578 1.715272664
83860114 139.63 600609 0 0 0.792151439 2.191305712 1.735811485
108824852 183.63 592647 0 0 1.200338228 1.505410723 1.80689348
processors [2]
0.992125314 1.783830282 1.72695
Nodes Time NPS splits bad
===== ==== === ====== ===
34125014 24.74 1379624 2318 172
130154025 114.49 1136865 6990 470
3970904 2.84 1396239 1228 110
95198616 86.59 1099380 2968 275
66429910 63.72 1042544 3208 177
130626630 121.98 1070850 3403 270
Return to Programming and Technical Discussions
Users browsing this forum: No registered users and 3 guests