Winboard Forum

by **Tord Romstad** » 08 Mar 2006, 17:24

Alessandro Scotti wrote:Hi Tord,
thanks for the link! Unfortunately ACM wants me to buy the article,

They do?! I had no idea.

I can view it without paying from here, but I guess this is because my university pays for some kind of subscription.

Difficult or not, just making a "mental picture" of a splitpoint and how the master thread at a split point can cooperate with other threads at other splitpoints had my head spinning for a good week or so!

Yes, I know. It's hard to maintain a clear mental picture of how the search works. But this is part of the fun, of course. :wink:

This is definitely something I want to carefully layout on paper before starting to write even a single line of code...

Good idea. You should also consider to start with a simple dual-threaded search before you proceed to a general multi-threaded search.

Tord

by **Scott Gasch** » 08 Mar 2006, 17:55

Alessandro Scotti wrote:Hi Tord,
thanks for the link! Unfortunately ACM wants me to buy the article, which suddenly makes the algorithm a lot less attractive!

Tord Romstad wrote:YBWC is much more fun, however. And contrary to popular belief, it isn't really more difficult to implement than shared hash table algorithms.

Well I'm about to discover I guess!

Difficult or not, just making a "mental picture" of a splitpoint and how the master thread at a split point can cooperate with other threads at other splitpoints had my head spinning for a good week or so!

This is definitely something I want to carefully layout on paper before starting to write even a single line of code...

There's a copy of the ABDADA paper here:

http://wannabe.guru.org/svn/typhoon/pap ... lel/acm.ps

I have never tried it but have heard from several people that it does not work so well.

Scott

by **bob** » 10 May 2006, 22:20

There are two issues here:

(1) if you do split at the root, you get an _optimal_ split point, because where else in your tree can you point to a node and say "I know with 100% reliability that I have to search every move from that node, no matter what happens"??? Doing this will shrink the size of your tree particularly when you do not change your mind on the best move at the root...

(2) if you do split at the root, you can waste time when you are going to change your mind, but with just one CPU searching that new best move, you won't find out nearly as quickly, perhaps not even before time runs out.

I have a solution that I have been using in Crafty for years now, a hybrid split at the root algorithm that takes the best of (1) and (2). Here's what I do:

As I search root moves, I count the nodes in their trees, as I have always done, since I use this to order the ply-1 move list for the depth N+1 search. If one (the first) root move has a clearly higher node count than the second-best, I conclude "OK, last search this move was significantly better than the rest, there was no move of interest that produced a larger-than-normal tree, so I search the first root move with all processors, then split at the root and search the rest with one processor each (until near the end of the search of course where there are no root moves left and idle processors jump in to help someone still searching a root move). But, if (say) the first three root moves have significantly higher node counts than the rest (think of the case where at iteration N-1, you get a new best move, so that the old best and new best both required large tree searches, and at the current iteration it is not uncommon to "switch back". Since I had three moves that produced "interesting node counts" I set a flag that says "search these three moves serially, one at a time, with all processors on a single move. Once I get through those three "suspicious" moves, I revert to my normal "split at the root" mode to maximize efficiency.

This gives me the best of both worlds, except for that one case where I am going to change my mind on the last iteration, to a move I have never considered best at previous iterations, nor did this move cause excessive tree nodes at previous iterations to make it appear "suspicious". Here, the new best move will get just one processor, but I never "time out" until any pending root moves are completed, so that as the "time out" flag is set, processors quit searching as they finish their root moves, but they are free to jump in and help any active searches still ongoing. So that they will, at the first time limit, jump in and help resolve that one pending move to see if it will become a "best" move.

Sounds more complicated than it really is, but I've been doing this for years and it has worked flawlessly. Yes there are "issues" and you have to make sure that there is no race where a worse score/move can be found after a better score/move. But it's worth the effort...

by **diepeveen** » 13 May 2006, 02:04

Tord Romstad wrote:Hi all,

My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.

Writing the SMP version was a much easier and less time-consuming task than I had imagined. I spent a couple of hours last weekend eliminating global variables and adding new parameters to a few functions, and another couple of hours last night writing and debugging the parallell search itself (a simple implementation of YBWC). The parallell search speedup seems reasonably good, in the few test positions I have tried so far I get an average speedup somewhere around 1.7-2.0. How well it works in games remains to be seen. I am sure there are lots of unexpected bugs and problems waiting to be found.

A few questions to the more experienced SMP programmers:
I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?
If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?
The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord

Hi,

Sorry for late reply. I hardly checkout this forum.
Feel free to email parallel questions. Not many people is expert here.

If a single split is very expensive then splitting at root makes sense. I remember vaguely Bob has to copy 44KB or so and in cray blitz he had to copy even 64KB for a single split. Something like that. In all other cases it doesn't make sense to split in root.

In diep i copy like 200 bytes or so to split. Something like that and cpu's already get put to work instantly, even before the cpu that splits in others, has stopped splitting in cpu's.

CPU's can directly start their own search when cpu X splits them in. Up to 32 cpu's in 1 splitpoint.

The reason is very trivial why splitting is bad idea. You could for example get a fail high in root, which means you'll have to abort all other cpu's. I know several other authors who tried also concluded like me that splitting in root = bad for speedup.

Of course you'll have to test it at 200+ positions or so and average results over that, to figure this out. Just testing 4 positions or so sucks a lot.

I'm using this approach for hashtable:
each process has its own eval hashtable and its own pawntable.

Suppose we have 400MB hashtable in total for transpositiontable, then each process allocates 100MB hashtable local at its own cpu.

That's what we call NUMA.

As you know it's very hard to get a good scaling program multithreaded under windows. Multiprocessing is easier there, though there is obstacles there too.

This way ensures a proper working.

After that each process is attaching to the other transpositiontables.

I also store search in qsearch by the way. The speedup of a shared hashtable which ensures NUMA performance, as each hashtable is allocated and especially cleaned at the right memory controller, which means that there is a good division between the memory controllers which memory controller stores where.

If you allocate the hashtable as 1 big hashtable, then obviously you don't have a NUMA hashtable, but 1 controller gets all requests and all stores, which is very bad idea.

This all is rather simple to make.

avoiding locking of hashtable is rather easy.

First of all, if you use a 400MB hashtable or so, just don't lock with your current program. Odds of a failure is very small.

At 200 cpu's i've tested the odds for a failure here at quite a lot of stores a second. The odds is real small that this gives a write error.

So the first idea is to skip locking.

Second idea is adding a small XOR trick. I'm sure Bob will describe it (didn't read his post but i can guess what he writes). He has an article about this Tim Mann trick.

There is other ways to avoid problems. That's simply getting within the cacheline of a processor. Each process has a write buffer.

If you take care you are within the write buffer, then you don't need to do an XOR even, as you have 100% garantuee that the write buffer, IF it gets written, gets written at once.

Only at some big highend supercomputer processors this can go wrong if some special interrupt gets shipped. Like control-c.

However if my program gets a control-c, not sure about yours, but mind might get killed then already, so i don't worry too much about this problem.

Odds for a write error were so tiny that i couldn't even write in words the odds. Only with a shitload of zero's i could added a 1.

That's when *not* writing within the same cacheline of course.

As diep is not garantueed to write within the same cacheline i'm doing a simple XOR.

Note that i don't store entire 64 bits of the key in hashtable.
I could look it up but it'll be a bit or 51 which i write in hashtable.

1 entry = 16 bytes.

You'll have to take care however that your entry is using _unsigned int64, that'll save a lot of trouble.

Because the problems grow bigger at highend processors (not to confuse with x86 processors) if your 16 bytes entry gets split for example into 4 writes of 4 bytes each.

The big difference between a lot of highend processors at supercomputers and x86 processors, to make the xample simplistic, is that between the memory and the processor there is a hub. That thing doesn't only connect the memory with the processor, but also the network to the memory and the processor.

SGI for example calls this a 'hub' at their origin3800 series.
Their altix series it is called shub.

On paper 2 times the bandwidth it has (paper supports everything), but the working is the same.

So the practice is, if you keep within 16 bytes and do 2 writes of each 8 byte datatypes, then you can risk really a lot here.

Ah i see now. I put 51 bits in the datatype 1 and 0 bits in the other. XOR the 2 and write the XOR'ed entry together with datatype 2 of 64 bits in the hashtable.

That's a pretty useless XOR instruction most of the time

As i'm of the "0% risk" type i'm doing it.

If i were Dieter Buerssner i would not consider doing that XOR

Vincent

by **diepeveen** » 13 May 2006, 02:20

bob wrote:There are two issues here:

(1) if you do split at the root, you get an _optimal_ split point, because where else in your tree can you point to a node and say "I know with 100% reliability that I have to search every move from that node, no matter what happens"??? Doing this will shrink the size of your tree particularly when you do not change your mind on the best move at the root...

(2) if you do split at the root, you can waste time when you are going to change your mind, but with just one CPU searching that new best move, you won't find out nearly as quickly, perhaps not even before time runs out.

I have a solution that I have been using in Crafty for years now, a hybrid split at the root algorithm that takes the best of (1) and (2). Here's what I do:

As I search root moves, I count the nodes in their trees, as I have always done, since I use this to order the ply-1 move list for the depth N+1 search. If one (the first) root move has a clearly higher node count than the second-best, I conclude "OK, last search this move was significantly better than the rest, there was no move of interest that produced a larger-than-normal tree, so I search the first root move with all processors, then split at the root and search the rest with one processor each (until near the end of the search of course where there are no root moves left and idle processors jump in to help someone still searching a root move). But, if (say) the first three root moves have significantly higher node counts than the rest (think of the case where at iteration N-1, you get a new best move, so that the old best and new best both required large tree searches, and at the current iteration it is not uncommon to "switch back". Since I had three moves that produced "interesting node counts" I set a flag that says "search these three moves serially, one at a time, with all processors on a single move. Once I get through those three "suspicious" moves, I revert to my normal "split at the root" mode to maximize efficiency.

This gives me the best of both worlds, except for that one case where I am going to change my mind on the last iteration, to a move I have never considered best at previous iterations, nor did this move cause excessive tree nodes at previous iterations to make it appear "suspicious". Here, the new best move will get just one processor, but I never "time out" until any pending root moves are completed, so that as the "time out" flag is set, processors quit searching as they finish their root moves, but they are free to jump in and help any active searches still ongoing. So that they will, at the first time limit, jump in and help resolve that one pending move to see if it will become a "best" move.

Sounds more complicated than it really is, but I've been doing this for years and it has worked flawlessly. Yes there are "issues" and you have to make sure that there is no race where a worse score/move can be found after a better score/move. But it's worth the effort...

The issue is that you claim a speedup of 3.1 out of 4 processors based upon 6 positions or so.

I can prove a 3.84 speedup out of 4 processors based upon 213 positions with a statistical significance which is quite huge.

Of course the reason for this is because of several reasons between how you and i are searching. If a processor has no mor e moves to search in diep, it can simply get put in the idle list.

A processor X that's busy can pickup that processor and directly split it in, just shipping it 128 bytes or so. That allows diep to work over slow networks too which supercomputers have. The idling processor then can initialize itself and start searching. Even when the idling processor initializes itself, the processor X is already searching further. It just ships a few bytes and boom searches on.

So a split is real real cheap compared to your global way of locking.

Of course it has been a lot of hard work to avoid global locking.

That means that splitting in root i can avoid.

You do a lot of effort IMHO for idle cpu's to keep them in work therapy. Sounds like some other approach is a better idea.

A split in root can be very expensive when you get a slight fail high or slight fail low there and your program wants to move on to the next move to search.

That won't happen in world champs against Fibchess of course, but it is quite likely it happens when you play Shredder... ...against such opponents your program doubts itself and its own reason of existance and having a worst case *then* is a very bad idea. Losing 75% of your computing power. In Bob's case if he's got 8 chips he could lose up to 7 out of 8 chips to this...

Vincent

by **bob** » 13 May 2006, 04:38

diepeveen wrote:
bob wrote:There are two issues here:

(1) if you do split at the root, you get an _optimal_ split point, because where else in your tree can you point to a node and say "I know with 100% reliability that I have to search every move from that node, no matter what happens"??? Doing this will shrink the size of your tree particularly when you do not change your mind on the best move at the root...

(2) if you do split at the root, you can waste time when you are going to change your mind, but with just one CPU searching that new best move, you won't find out nearly as quickly, perhaps not even before time runs out.

I have a solution that I have been using in Crafty for years now, a hybrid split at the root algorithm that takes the best of (1) and (2). Here's what I do:

As I search root moves, I count the nodes in their trees, as I have always done, since I use this to order the ply-1 move list for the depth N+1 search. If one (the first) root move has a clearly higher node count than the second-best, I conclude "OK, last search this move was significantly better than the rest, there was no move of interest that produced a larger-than-normal tree, so I search the first root move with all processors, then split at the root and search the rest with one processor each (until near the end of the search of course where there are no root moves left and idle processors jump in to help someone still searching a root move). But, if (say) the first three root moves have significantly higher node counts than the rest (think of the case where at iteration N-1, you get a new best move, so that the old best and new best both required large tree searches, and at the current iteration it is not uncommon to "switch back". Since I had three moves that produced "interesting node counts" I set a flag that says "search these three moves serially, one at a time, with all processors on a single move. Once I get through those three "suspicious" moves, I revert to my normal "split at the root" mode to maximize efficiency.

This gives me the best of both worlds, except for that one case where I am going to change my mind on the last iteration, to a move I have never considered best at previous iterations, nor did this move cause excessive tree nodes at previous iterations to make it appear "suspicious". Here, the new best move will get just one processor, but I never "time out" until any pending root moves are completed, so that as the "time out" flag is set, processors quit searching as they finish their root moves, but they are free to jump in and help any active searches still ongoing. So that they will, at the first time limit, jump in and help resolve that one pending move to see if it will become a "best" move.

Sounds more complicated than it really is, but I've been doing this for years and it has worked flawlessly. Yes there are "issues" and you have to make sure that there is no race where a worse score/move can be found after a better score/move. But it's worth the effort...

The issue is that you claim a speedup of 3.1 out of 4 processors based upon 6 positions or so.

So the huge set of log files I posted a year ago were for 6 positions? And not for the entire DTS position set? Ever think about how stupid you look when you write such nonsense???

I can prove a 3.84 speedup out of 4 processors based upon 213 positions with a statistical significance which is quite huge.

Of course the reason for this is because of several reasons between how you and i are searching. If a processor has no mor e moves to search in diep, it can simply get put in the idle list.

A processor X that's busy can pickup that processor and directly split it in, just shipping it 128 bytes or so. That allows diep to work over slow networks too which supercomputers have. The idling processor then can initialize itself and start searching. Even when the idling processor initializes itself, the processor X is already searching further. It just ships a few bytes and boom searches on.

So a split is real real cheap compared to your global way of locking.

You keep saying that. Yet I keep posting logs that produce perfectly acceptable speedups over a set of positions from a real game, or from test positions from test suites, or from test positions I have collected myself. You can say it all you want, but those (besides myself) that are running SMP versions of Crafty _know_ what kinds of speedups it will produce on duals and quads and 8ways and so forth.

Of course it has been a lot of hard work to avoid global locking.

That means that splitting in root i can avoid.

You do a lot of effort IMHO for idle cpu's to keep them in work therapy. Sounds like some other approach is a better idea.

A split in root can be very expensive when you get a slight fail high or slight fail low there and your program wants to move on to the next move to search.

How about thinking for just a bit before writing? On a fail low, that does, by _definition_ happen on the first root move. Surely you realize that I don't split at the root until the first move has been searched completely? And sometimes until the first N have been searched one at a time with all processors searching just one? The fail high is a potential problem, but it is solvable with thought and effort. I will _never_ miss a fail high because one processor is working on the fail high move and time runs out. Any more than I miss such a fail high on a non-parallel search either, and for the very same reason. I don't give up on a root move until it fails high or low, once I start searching on it. And with an SMP search, as other processors complete their root moves and discover "can't search any more root moves, time is up" they can still jump in and help busy processors complete their root moves more quickly. As I said, this problem is fixable, and the overhead saved by splitting at the root more than offsets the fail high overhead. There is no fail-low overhead so I have no idea what you are talking about there, as usual...

Don't let your incredible ignorance of alpha/beta trees and parallel search get in your way... The root is the _perfect_ place to split. It is the only node in the entire tree where you can _guarantee_ that you need to search each and every branch.

If you want, just say "OK, I can't figure out how to do it efficiently or reasonably" and let it go at that. I can turn it off in Crafty and I can absolutely prove that the trees are smaller with splitting at the root enabled, and I still avoid the problem of changing my mind iteration to iteration and searching too much root stuff with the wrong bound. I can also prove that over a set of positions, that the trees are smaller _and_ the time to complete a particular search to some fixed depth is faster as well.

You should try it rather than waving your hands and declaring it doesn't work. Of course you claimed late move reductions were a bad idea, yet they seem to be working fine for me and many others. It's just in how hard you are willing to work to make something work. Or how hard you are willing to work to find reasons that something won't work, without actually trying.

That won't happen in world champs against Fibchess of course, but it is quite likely it happens when you play Shredder... ...against such opponents your program doubts itself and its own reason of existance and having a worst case *then* is a very bad idea. Losing 75% of your computing power. In Bob's case if he's got 8 chips he could lose up to 7 out of 8 chips to this...

Vincent

Where do you get those numbers? I know where I "think" you pull them from. Perhaps that's why they stink so? A fail low generally happens on the _first_ root move or it is not important. There is no efficiency issue there whatsoever...

I'll just leave it at "it works for me. If you want to make it work for you, it can be done." But one does have to _try_ as there are issues with splitting at the root that require some thought. I've explained the overall idea I am using. The implementation details certainly have some pitfalls that have to be solved to avoid an ugly race here and there...

by **bob** » 13 May 2006, 04:53

BTW:

For the new parallel search programmers out there... You _must_ understand the differences between the following four declarations or you are going to be doing a _lot_ of debugging...

Code: Select all: volatile int x; volatile int *x; int * volatile x; volatile int * volatile x;

If those all make perfect sense, you should have no problems with the compiler optimizing you into oblivion, so long as you use a good inline spin lock (inline asm) so that the compiler can not move it around in your code and wreck your "critical section" protection.

I might add that I debugged the difference between (2) and (3) above myself and it is a real pain to find until you dig thru the .asm the compiler produces and discover the difference.

if you have questions about "volatile" feel free to ask...

by **diepeveen** » 13 May 2006, 18:49

bob wrote:
diepeveen wrote:
bob wrote:There are two issues here:

(1) if you do split at the root, you get an _optimal_ split point, because where else in your tree can you point to a node and say "I know with 100% reliability that I have to search every move from that node, no matter what happens"??? Doing this will shrink the size of your tree particularly when you do not change your mind on the best move at the root...

(2) if you do split at the root, you can waste time when you are going to change your mind, but with just one CPU searching that new best move, you won't find out nearly as quickly, perhaps not even before time runs out.

I have a solution that I have been using in Crafty for years now, a hybrid split at the root algorithm that takes the best of (1) and (2). Here's what I do:

As I search root moves, I count the nodes in their trees, as I have always done, since I use this to order the ply-1 move list for the depth N+1 search. If one (the first) root move has a clearly higher node count than the second-best, I conclude "OK, last search this move was significantly better than the rest, there was no move of interest that produced a larger-than-normal tree, so I search the first root move with all processors, then split at the root and search the rest with one processor each (until near the end of the search of course where there are no root moves left and idle processors jump in to help someone still searching a root move). But, if (say) the first three root moves have significantly higher node counts than the rest (think of the case where at iteration N-1, you get a new best move, so that the old best and new best both required large tree searches, and at the current iteration it is not uncommon to "switch back". Since I had three moves that produced "interesting node counts" I set a flag that says "search these three moves serially, one at a time, with all processors on a single move. Once I get through those three "suspicious" moves, I revert to my normal "split at the root" mode to maximize efficiency.

This gives me the best of both worlds, except for that one case where I am going to change my mind on the last iteration, to a move I have never considered best at previous iterations, nor did this move cause excessive tree nodes at previous iterations to make it appear "suspicious". Here, the new best move will get just one processor, but I never "time out" until any pending root moves are completed, so that as the "time out" flag is set, processors quit searching as they finish their root moves, but they are free to jump in and help any active searches still ongoing. So that they will, at the first time limit, jump in and help resolve that one pending move to see if it will become a "best" move.

Sounds more complicated than it really is, but I've been doing this for years and it has worked flawlessly. Yes there are "issues" and you have to make sure that there is no race where a worse score/move can be found after a better score/move. But it's worth the effort...

The issue is that you claim a speedup of 3.1 out of 4 processors based upon 6 positions or so.

So the huge set of log files I posted a year ago were for 6 positions? And not for the entire DTS position set? Ever think about how stupid you look when you write such nonsense???

I can prove a 3.84 speedup out of 4 processors based upon 213 positions with a statistical significance which is quite huge.

Of course the reason for this is because of several reasons between how you and i are searching. If a processor has no mor e moves to search in diep, it can simply get put in the idle list.

A processor X that's busy can pickup that processor and directly split it in, just shipping it 128 bytes or so. That allows diep to work over slow networks too which supercomputers have. The idling processor then can initialize itself and start searching. Even when the idling processor initializes itself, the processor X is already searching further. It just ships a few bytes and boom searches on.

So a split is real real cheap compared to your global way of locking.

You keep saying that. Yet I keep posting logs that produce perfectly acceptable speedups over a set of positions from a real game, or from test positions from test suites, or from test positions I have collected myself. You can say it all you want, but those (besides myself) that are running SMP versions of Crafty _know_ what kinds of speedups it will produce on duals and quads and 8ways and so forth.

Of course it has been a lot of hard work to avoid global locking.

That means that splitting in root i can avoid.

You do a lot of effort IMHO for idle cpu's to keep them in work therapy. Sounds like some other approach is a better idea.

A split in root can be very expensive when you get a slight fail high or slight fail low there and your program wants to move on to the next move to search.

How about thinking for just a bit before writing? On a fail low, that does, by _definition_ happen on the first root move. Surely you realize that I don't split at the root until the first move has been searched completely? And sometimes until the first N have been searched one at a time with all processors searching just one? The fail high is a potential problem, but it is solvable with thought and effort. I will _never_ miss a fail high because one processor is working on the fail high move and time runs out. Any more than I miss such a fail high on a non-parallel search either, and for the very same reason. I don't give up on a root move until it fails high or low, once I start searching on it. And with an SMP search, as other processors complete their root moves and discover "can't search any more root moves, time is up" they can still jump in and help busy processors complete their root moves more quickly. As I said, this problem is fixable, and the overhead saved by splitting at the root more than offsets the fail high overhead. There is no fail-low overhead so I have no idea what you are talking about there, as usual...

Don't let your incredible ignorance of alpha/beta trees and parallel search get in your way... The root is the _perfect_ place to split. It is the only node in the entire tree where you can _guarantee_ that you need to search each and every branch.

If you want, just say "OK, I can't figure out how to do it efficiently or reasonably" and let it go at that. I can turn it off in Crafty and I can absolutely prove that the trees are smaller with splitting at the root enabled, and I still avoid the problem of changing my mind iteration to iteration and searching too much root stuff with the wrong bound. I can also prove that over a set of positions, that the trees are smaller _and_ the time to complete a particular search to some fixed depth is faster as well.

You should try it rather than waving your hands and declaring it doesn't work. Of course you claimed late move reductions were a bad idea, yet they seem to be working fine for me and many others. It's just in how hard you are willing to work to make something work. Or how hard you are willing to work to find reasons that something won't work, without actually trying.

That won't happen in world champs against Fibchess of course, but it is quite likely it happens when you play Shredder... ...against such opponents your program doubts itself and its own reason of existance and having a worst case *then* is a very bad idea. Losing 75% of your computing power. In Bob's case if he's got 8 chips he could lose up to 7 out of 8 chips to this...

Vincent

Where do you get those numbers? I know where I "think" you pull them from. Perhaps that's why they stink so? A fail low generally happens on the _first_ root move or it is not important. There is no efficiency issue there whatsoever...

I'll just leave it at "it works for me. If you want to make it work for you, it can be done." But one does have to _try_ as there are issues with splitting at the root that require some thought. I've explained the overall idea I am using. The implementation details certainly have some pitfalls that have to be solved to avoid an ugly race here and there...

What i tried to explain is if your score drops a bit, like 0.02 of a pawn, which i call a "small fail low",
after that it's possible in an important game that this mainline gets replaced by a new one, as the lower score causes big doubts in your program.

Of course technically spoken, in diep i can NEVER have a fail low, as i start root search with [-inf;inf] using PVS. Yet my time division sees *any* drop in root score as a fail low (which doesn't mean that it takes always drastic actions).

With respect to root splitting, as long as you need to ship so much data, your first aim will be avoiding a lot of splits.

As for Diep, splits are cheap. All a processor needs basically is the movelist seen from root and a pointer to the father position to write the result of this search (or take over search completely if no other processors are searching), so that's why i don't split in the root of course, as splitting in root is bad for your average speedup, when you're already at around 3.84 out of 4 speedup.

By the way, deciding where to split in diep is a simplistic variable.

Vincent

by **bob** » 13 May 2006, 19:56

What i tried to explain is if your score drops a bit, like 0.02 of a pawn, which i call a "small fail low",
after that it's possible in an important game that this mainline gets replaced by a new one, as the lower score causes big doubts in your program.

What does that hurt? I search the first move, I fail low, I search the second move, which might or might not be done in parallel with the rest, depending on the node counts for moves 1 (which failed low) and move 2 (unsearched at this point) obtained from the previous iteration.

Of course technically spoken, in diep i can NEVER have a fail low, as i start root search with [-inf;inf] using PVS. Yet my time division sees *any* drop in root score as a fail low (which doesn't mean that it takes always drastic actions).

With respect to root splitting, as long as you need to ship so much data, your first aim will be avoiding a lot of splits.

You completely miss the point. This is _not_ about split overhead. I sometimes split 50-100K times during a search. My split overhead is not that high. This is about reducing the size of the alpha/beta tree when searched in parallel, and that means not searching stuff the serial search would not search. Splitting at the root is an easy way to do this. I did it with Cray Blitz which had almost zero split overhead thanks to vectors, and I do it still in Crafty without regard to the split overhead. I do limit how deeply into the tree I split, generally avoiding splits in the last 3 plies of full-width, but that is my only current limit and I tune that to the specific machine I am running on by testing.

As for Diep, splits are cheap. All a processor needs basically is the movelist seen from root and a pointer to the father position to write the result of this search (or take over search completely if no other processors are searching), so that's why i don't split in the root of course, as splitting in root is bad for your average speedup, when you're already at around 3.84 out of 4 speedup.

Splitting at the root is _not_ bad for your average speedup. If you do it wrong, it could hurt I suppose, as could anything. But it doesn't hurt mine. If you saw the results I posted a few months back when someone asked about speedup, I posted the results from running the DTS positions with 1, 2, 4 and 8 processors, and for 8 test runs with four processors, the speedup was right at 4.0 which was surprising. 8 was at 6.1 or so. Splitting at the root is _not_ a problem. It is a solution to a certain type of problem. Whether you want to address that particular problem is up to you. I chose to do it even though it was significantly more complicated to implement the adaptive split-at-root algorithm I currently use.

By the way, deciding where to split in diep is a simplistic variable.

Vincent

by **diepeveen** » 13 May 2006, 20:16

bob wrote:
What i tried to explain is if your score drops a bit, like 0.02 of a pawn, which i call a "small fail low",
after that it's possible in an important game that this mainline gets replaced by a new one, as the lower score causes big doubts in your program.

What does that hurt? I search the first move, I fail low, I search the second move, which might or might not be done in parallel with the rest, depending on the node counts for moves 1 (which failed low) and move 2 (unsearched at this point) obtained from the previous iteration.

Of course technically spoken, in diep i can NEVER have a fail low, as i start root search with [-inf;inf] using PVS. Yet my time division sees *any* drop in root score as a fail low (which doesn't mean that it takes always drastic actions).

With respect to root splitting, as long as you need to ship so much data, your first aim will be avoiding a lot of splits.

You completely miss the point. This is _not_ about split overhead. I sometimes split 50-100K times during a search. My split overhead is not that high. This is about reducing the size of the alpha/beta tree when searched in parallel, and that means not searching stuff the serial search would not search. Splitting at the root is an easy way to do this. I did it with Cray Blitz which had almost zero split overhead thanks to vectors, and I do it still in Crafty without regard to the split overhead. I do limit how deeply into the tree I split, generally avoiding splits in the last 3 plies of full-width, but that is my only current limit and I tune that to the specific machine I am running on by testing.

As for Diep, splits are cheap. All a processor needs basically is the movelist seen from root and a pointer to the father position to write the result of this search (or take over search completely if no other processors are searching), so that's why i don't split in the root of course, as splitting in root is bad for your average speedup, when you're already at around 3.84 out of 4 speedup.

Splitting at the root is _not_ bad for your average speedup. If you do it wrong, it could hurt I suppose, as could anything. But it doesn't hurt mine. If you saw the results I posted a few months back when someone asked about speedup, I posted the results from running the DTS positions with 1, 2, 4 and 8 processors, and for 8 test runs with four processors, the speedup was right at 4.0 which was surprising. 8 was at 6.1 or so. Splitting at the root is _not_ a problem. It is a solution to a certain type of problem. Whether you want to address that particular problem is up to you. I chose to do it even though it was significantly more complicated to implement the adaptive split-at-root algorithm I currently use.

By the way, deciding where to split in diep is a simplistic variable.

Vincent

You know, we play in different leagues.

If you start with 4 chips, your speedup is 3.1 out of 4 you claim.
If i start with 4 chips, my speedup is 3.84-3.86 on average.

I lose 15% of 1 chip roughly, you lose a full chip.

Basically i lose that 15% to memory latency.

Vincent

by **bob** » 14 May 2006, 02:57

diepeveen wrote:
You know, we play in different leagues.

If you start with 4 chips, your speedup is 3.1 out of 4 you claim.
If i start with 4 chips, my speedup is 3.84-3.86 on average.

I lose 15% of 1 chip roughly, you lose a full chip.

Basically i lose that 15% to memory latency.

Vincent

I don't "claim" anything. I have stated that a few years ago, when I ran a bunch of tests, my speedup curve predicted a typical speedup of 3.1. But that is not always the case. The DTS positions, where the game "flows" from move to move, has produced better results since the hash data carries from move to move as in a _real_ game.

You are right about the different leages. You just don't yet realize what league you are in. Again, I'll remind you that "my leage" has been doing pretty well against "your league" for at least a couple of years.

Somehow you keep forgetting about that small problem with your statements... You claim superiority, but can't seem to deliver it with any regularity of any kind... Which makes the claims a bit shallow...

Winboard Forum

Assorted SMP questions

Re: Assorted SMP questions

Re: Assorted SMP questions

splitting at the root

Re: Assorted SMP questions

Re: splitting at the root

Re: splitting at the root

volatile keyword in declarations.

Re: splitting at the root

Re: splitting at the root

Re: splitting at the root

Re: splitting at the root

Who is online