Parallel search efficiency
Posted: 17 Aug 2006, 19:41
Hi,
I recently got a dual proc dual core machine and set off to get my engine (which already worked on a dual proc machine) to work on this new quad beast.
The scheme I use is to keep a pool of "helper threads" and a pool of "split structs". When any thread reaches a point in the tree which looks like an all node it:
1. checks to see if there are idle helpers
2. populates a split struct
3. and assigns the helpers to help it search in parallel.
A thread _can_ resplit the tree if it is already under a split and threads are idle. This helps to minimize the idle time of helper threads.
On a dual proc machine this works reasonably well. Raw NPS increases proportionally to the number of searcher threads, all helper threads are kept busy 95%+ of the time, and overall speedup is about 1.5x. There's room for improvement here but its not too bad.
Searching with 4 threads works... but the raw speedup is less than I had hoped. The overall speedup with 4 threads is only about 2.25x. The raw speedup with 3 threads is 1.85x. Of course the raw NPS scales roughly linearly in both cases. Again, all helper threads are busy ~95% of the time.
I've checked things like lock contention and the speed with which a helper thread can abort the search if a split above it is terminated early. I have convinced myself that these are both reasonable.
I'm keeping track of the number of splits that terminate early and, across a big test suite like ecm, it averages around 6%.
One thing I have found is that if I drastically reduce the size of the hash table there is a reasonably large speed increase. I am operating under the theory that part of the problem is memory contention.
I'm wondering if anyone else has played much with parallel search (esp on a 4 proc box) and has results to share or advice about how to minimize parallel overhead.
Thx,
Scott
I recently got a dual proc dual core machine and set off to get my engine (which already worked on a dual proc machine) to work on this new quad beast.
The scheme I use is to keep a pool of "helper threads" and a pool of "split structs". When any thread reaches a point in the tree which looks like an all node it:
1. checks to see if there are idle helpers
2. populates a split struct
3. and assigns the helpers to help it search in parallel.
A thread _can_ resplit the tree if it is already under a split and threads are idle. This helps to minimize the idle time of helper threads.
On a dual proc machine this works reasonably well. Raw NPS increases proportionally to the number of searcher threads, all helper threads are kept busy 95%+ of the time, and overall speedup is about 1.5x. There's room for improvement here but its not too bad.
Searching with 4 threads works... but the raw speedup is less than I had hoped. The overall speedup with 4 threads is only about 2.25x. The raw speedup with 3 threads is 1.85x. Of course the raw NPS scales roughly linearly in both cases. Again, all helper threads are busy ~95% of the time.
I've checked things like lock contention and the speed with which a helper thread can abort the search if a split above it is terminated early. I have convinced myself that these are both reasonable.
I'm keeping track of the number of splits that terminate early and, across a big test suite like ecm, it averages around 6%.
One thing I have found is that if I drastically reduce the size of the hash table there is a reasonably large speed increase. I am operating under the theory that part of the problem is memory contention.
I'm wondering if anyone else has played much with parallel search (esp on a 4 proc box) and has results to share or advice about how to minimize parallel overhead.
Thx,
Scott