Tord Romstad wrote:Hi all,
My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.
Writing the SMP version was a much easier and less time-consuming task than I had imagined. I spent a couple of hours last weekend eliminating global variables and adding new parameters to a few functions, and another couple of hours last night writing and debugging the parallell search itself (a simple implementation of YBWC). The parallell search speedup seems reasonably good, in the few test positions I have tried so far I get an average speedup somewhere around 1.7-2.0. How well it works in games remains to be seen. I am sure there are lots of unexpected bugs and problems waiting to be found.
A few questions to the more experienced SMP programmers:
- I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?
- If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?
- The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord
Hi,
Sorry for late reply. I hardly checkout this forum.
Feel free to email parallel questions. Not many people is expert here.
If a single split is very expensive then splitting at root makes sense. I remember vaguely Bob has to copy 44KB or so and in cray blitz he had to copy even 64KB for a single split. Something like that. In all other cases it doesn't make sense to split in root.
In diep i copy like 200 bytes or so to split. Something like that and cpu's already get put to work instantly, even before the cpu that splits in others, has stopped splitting in cpu's.
CPU's can directly start their own search when cpu X splits them in. Up to 32 cpu's in 1 splitpoint.
The reason is very trivial why splitting is bad idea. You could for example get a fail high in root, which means you'll have to abort all other cpu's. I know several other authors who tried also concluded like me that splitting in root = bad for speedup.
Of course you'll have to test it at 200+ positions or so and average results over that, to figure this out. Just testing 4 positions or so sucks a lot.
I'm using this approach for hashtable:
each process has its own eval hashtable and its own pawntable.
Suppose we have 400MB hashtable in total for transpositiontable, then each process allocates 100MB hashtable local at its own cpu.
That's what we call NUMA.
As you know it's very hard to get a good scaling program multithreaded under windows. Multiprocessing is easier there, though there is obstacles there too.
This way ensures a proper working.
After that each process is attaching to the other transpositiontables.
I also store search in qsearch by the way. The speedup of a shared hashtable which ensures NUMA performance, as each hashtable is allocated and especially cleaned at the right memory controller, which means that there is a good division between the memory controllers which memory controller stores where.
If you allocate the hashtable as 1 big hashtable, then obviously you don't have a NUMA hashtable, but 1 controller gets all requests and all stores, which is very bad idea.
This all is rather simple to make.
avoiding locking of hashtable is rather easy.
First of all, if you use a 400MB hashtable or so, just don't lock with your current program. Odds of a failure is very small.
At 200 cpu's i've tested the odds for a failure here at quite a lot of stores a second. The odds is real small that this gives a write error.
So the first idea is to skip locking.
Second idea is adding a small XOR trick. I'm sure Bob will describe it (didn't read his post but i can guess what he writes). He has an article about this Tim Mann trick.
There is other ways to avoid problems. That's simply getting within the cacheline of a processor. Each process has a write buffer.
If you take care you are within the write buffer, then you don't need to do an XOR even, as you have 100% garantuee that the write buffer, IF it gets written, gets written at once.
Only at some big highend supercomputer processors this can go wrong if some special interrupt gets shipped. Like control-c.
However if my program gets a control-c, not sure about yours, but mind might get killed then already, so i don't worry too much about this problem.
Odds for a write error were so tiny that i couldn't even write in words the odds. Only with a shitload of zero's i could added a 1.
That's when *not* writing within the same cacheline of course.
As diep is not garantueed to write within the same cacheline i'm doing a simple XOR.
Note that i don't store entire 64 bits of the key in hashtable.
I could look it up but it'll be a bit or 51 which i write in hashtable.
1 entry = 16 bytes.
You'll have to take care however that your entry is using _unsigned int64, that'll save a lot of trouble.
Because the problems grow bigger at highend processors (not to confuse with x86 processors) if your 16 bytes entry gets split for example into 4 writes of 4 bytes each.
The big difference between a lot of highend processors at supercomputers and x86 processors, to make the xample simplistic, is that between the memory and the processor there is a hub. That thing doesn't only connect the memory with the processor, but also the network to the memory and the processor.
SGI for example calls this a 'hub' at their origin3800 series.
Their altix series it is called shub.
On paper 2 times the bandwidth it has (paper supports everything), but the working is the same.
So the practice is, if you keep within 16 bytes and do 2 writes of each 8 byte datatypes, then you can risk really a lot here.
Ah i see now. I put 51 bits in the datatype 1 and 0 bits in the other. XOR the 2 and write the XOR'ed entry together with datatype 2 of 64 bits in the hashtable.
That's a pretty useless XOR instruction most of the time
As i'm of the "0% risk" type i'm doing it.
If i were Dieter Buerssner i would not consider doing that XOR
Vincent