Hi Tord,
Tord Romstad wrote:Hi Mridul,
Thanks for your reply!
mridul wrote:Hi Tord,
Tord Romstad wrote:Hi all,
My first SMP version of Glaurung is now mostly finished (it will need a few weeks of polishing before it is ready for a public release, though). The code is still quite simple and not very general, and will not work with more than 2 CPUs. After much consideration I ended up using threads rather than processes. It just seemed simpler and cleaner to me this way.
Congrats !
A bit too early to congratulate, I'm afraid. There are still a few bugs. I have seen the program crash a couple of times, and I have seen at least one mate announcement I don't quite believe in (posted at the CCC). The parallel search seems to be 99% correct, but I fear that getting it 100% correct will take a lot of time. I also suspect that going from 2 to n CPUs will prove far more difficult than going from 1 to 2 CPUs.
Until I got it 100% correct (I think it is) I had all sorts of weird bugs - shouldn't be too tough to notice.
Like , your main pv/score will be different between single proc and parallel version : theoretically , this is possible to happen ... but in practise should happen extremely rarely.
Even now , I just need to recompile with -DMP_DEBUG to test that parallel search is not broken.
All preconditions and postconditions where multiple processes interact (be it in locks or sharing data , etc) are tested for the assumption made while entry and exit time.
An example of a class of bugs which I found through this which was totally unintutive (atleast to me) was not setting a variable as 'volatile' when it should be...
By the way, are there any good ways to simulate more than 2 CPUs on a dual CPU computer?
Well , the major problem with simulating MP search on a single processor is that the amount of concurrency you can simulate is dependent on the granuality of the OS context switch (we usually dont block on IO in chess engines) : and since this granuality is in order of milliseconds , we almost never notice any concurrency problems.
Even on a dual , this is markedly changed : you may not identify all the concurrency bugs , but the unpredictability of a context switch is much higher.
Before I tested on a quad , I just ran 4 process search on my dual and did find a bug (a minor one , thankfully
) ... did not catch any on the quad.
Ofcourse , having more number of processors to test on is always helpful
When you do this , the nps will suck , the scaling will suck - but the parallel will get validated to some extent until you find a quad or higher (I use spinlock type busy wait while searching for a split point in the children).
A few questions to the more experienced SMP programmers:
Not really experienced , but I can give it a shot...
- I currently don't split at the root. Adding this wouldn't be technically difficult, but it would make my code more complicated and messy. How much can I expect to gain by splitting at the root? Is it worth doing?
I dont split at the root too ... YMMV ofcourse.
OK, thanks.
- If I should split at the root, how am I supposed to use the "info currmove" command in the UCI interface? How do I tell the GUI that CPU 1 is searching the move 1. e4, while CPU 2 is searching the move 1. d4?
Just to get an idea of what you are doing , consider this case :
* Processor 1 is searching at ply 2
* It decides that other processors can search this ply's subtree.
* Processor 2 starts searching a move from this subtree.
* Before processor 2 completes , all other moves are exhausted by processor 1 at that ply.
What do you do in this case ?
Processor 1 enters an idle loop, and waits for processor 2 to complete its work. While processor 1 is idle, processor 2 can give it work to do inside the sub-tree it is searching. The idea is called the "helpful master" concept in the YBW papers, I think.
I guess the answer to your question will depend on this ...
Why? Perhaps I am dense, but I cannot see how it has any relevance to my question.
The reason why I asked is to find out how you search , so that I will not say some stupid suggestion
If you dont split at root , and the initial thread which started search continues to be the owner of the search , then cant you not always return the currmove of that thread ? (and the pvline until it splits under a child).
- The main transposition table should obviously be shared by all threads. But what do you do about the other hash tables (pawn hash table, evaluation hash table, etc.)? Should they be shared, or is it better to have a separate hash table for each thread? Currently I have chosen separate tables for each thread, in order to avoid too much locking and unlocking.
Tord
There is a lockless way to access transposition table by Tim Mann (I think) - really useful
Yes, I have seen that (but I thought it was by Bob Hyatt), and intend to use it for my main transposition table.
I dont share any other table - so each process has its own killer/eval cache/king eval cache/qsearch hash/etc.
Just what I do, then.
Not quite easy to get the others lockless (locking would be too expensive in the long run) and the benifits of sharing these are not that high ...
In the case of the pawn hash table, you are probably right. My evaluation hash table is much less effective when I don't share it between threads, though.
I dont know how you store things in eval cache , but I could not get it lockless ... mine is an extremely inefficient datastructure.
I found that there is like a 4% or so cache hit reduction when I did not share it , but the penality of sharing and locking it would be much worse I guess.
More than eval cache , I was interested in sharing killers ... they effect the move ordering and unpredictable orders are sometimes witnessed when different processes have different move ordering 'cos of their individual processes killer info (this can be damn frustrating when you are debugging the search tree dump to get to that crazy bug ...).
Thankfully , I dont use history :-P
In practise , it is ok to overwrite the values , mess it up , etc since we just use killers and history as hints ... but I kind of prefer more predictability.
Also , if you run on NUMA box , it makes more sense to try to maintain data used by a thread/process in its local memory (I did not want to design something which will have problems in the future 'when' I run on a NUMA box
).
Until you get parallel search working well , you will have endless hours of irritation (fun ?
) debugging your code/output , moments of "duh ! that was so stupid of me" when you find the bug and really really high amount of satisfaction when you see your program blaze away !
Have fun !
Hope this helps.
Mridul
Hope this helps,
It does. Thanks for the input!
Tord