Scott Gasch wrote:Hi,
I recently got a dual proc dual core machine and set off to get my engine (which already worked on a dual proc machine) to work on this new quad beast.
The scheme I use is to keep a pool of "helper threads" and a pool of "split structs". When any thread reaches a point in the tree which looks like an all node it:
1. checks to see if there are idle helpers
2. populates a split struct
3. and assigns the helpers to help it search in parallel.
A thread _can_ resplit the tree if it is already under a split and threads are idle. This helps to minimize the idle time of helper threads.
On a dual proc machine this works reasonably well. Raw NPS increases proportionally to the number of searcher threads, all helper threads are kept busy 95%+ of the time, and overall speedup is about 1.5x. There's room for improvement here but its not too bad.
Searching with 4 threads works... but the raw speedup is less than I had hoped. The overall speedup with 4 threads is only about 2.25x. The raw speedup with 3 threads is 1.85x. Of course the raw NPS scales roughly linearly in both cases. Again, all helper threads are busy ~95% of the time.
I've checked things like lock contention and the speed with which a helper thread can abort the search if a split above it is terminated early. I have convinced myself that these are both reasonable.
I'm keeping track of the number of splits that terminate early and, across a big test suite like ecm, it averages around 6%.
One thing I have found is that if I drastically reduce the size of the hash table there is a reasonably large speed increase. I am operating under the theory that part of the problem is memory contention.
I'm wondering if anyone else has played much with parallel search (esp on a 4 proc box) and has results to share or advice about how to minimize parallel overhead.
Thx,
Scott
Hi Scott,
Nice to see you're busy parallel.
You mention 'idle thread'. What do you mean with an idle thread.
a) a thread that is very busy polling; so not idle, it just has no job to do
b) a real idle thread that needs to get put in what in linux we would call the 'runqueue'.
i'd note that i'm using A. In linux that's a 10 ms latency in case of B which A doesn't have.
c) how much work needs to get done to do a split?
in short how much bytes do you ship and how many locks and other funky windows commands do you use to split?
Get rid of windows function calls i'd say.
In Diep when doing a split i basically only need to ship a few moves to a processor that's idle (in short not an idle processor but a polling processor which polls in its own local memory).
d) you mention you work with threads. Everyone working with threads reports scaling problems with windows. Everyone working with processes directly works great. I don't say threads are not going to work, i just want to note that it's a lot easier for most programmers to get multiprocessing to work above multithreading. Besides that multithreading loses you a register (which nowadays with all those rename registers might not be that relevant in 64 bits like it used to be in the past with processors). Just now and then you gotta restart all your processes as the shared memory gets somehow slow by the kernel. you know more there than i do why the memory gets slow. A reboot of the machine helps a lot. In linux there is not a problem at all. Multiprocessing easier lets your program scale better then. Perhaps it's harder to fix things with threads than with processes?
e) with regards to the speedup, the usual way to split is to only split using the YBW (young brother wait) principle. In diep one of the miminum conditions to split at under or equal to 16 processors is to split only when i have already searched nullmove, done IID (if that gets used anyway), and searched the first move in the movelist.
Only THEN it's possible to do a split sometimes (there is other conditions btw which are less relevant).
This YBW principle works great.
p.s. it's all about at which memory node you allocate the memory at an AMD box.