Moderator: Andres Valverde
void SEARCHER::stop_workers() {
for(int i = 0; i < n_processors; i++) {
if(workers[i]) {
l_lock(workers[i]->lock);
workers[i]->stop_searcher = 1;
workers[i]->stop_workers();
l_unlock(workers[i]->lock);
}
}
}
Now when a beta cut off occurs
if(score >= pstack->beta) {
l_lock(lock_smp);
l_lock(master->lock);
master->stop_workers();
update_master();
l_unlock(master->lock);
l_unlock(lock_smp);
break;
}
Fritz Reul wrote:
I am comparing YBWC and Shared Hashtables with each other. The Shared Hashtables perform quite good on 2 core system. The YBWC are not much faster but more complicated.
So far LOOP is ready to use Shared Hashtables or YBWC. But I am currently not shure which technology is really better. Shared Hashtables are simpler and therefore the engine is much easier to develop and tune.
My biggest problem is the YBWC with >=4 Threads. I dont really know how to stop the Master and Slave Threads efficiently when at the local SplitNode a BetaCut occurs.
If it is possible for each Thread to launch i.e. 8 SplitNodes and we use 4 Threads then we have to manage up to 4x8=32 recursive SplitNodes. Every Threads is a Master for its own SplitNode. But what happens, if a CutOff occurs in a SplitNode near the root? How is it possible to stop all the SlaveThreads of this SplitNode where the CutOff occured?
Example:
We have 4 Threads T[4] (0-3) and 8 SplitNodes N[4][8] (0-7) for every Thread.
T[0] is the master of its first SplitNode N[0][0].
The Slave Threads T[1-3] are idle.
T[0] launches a new parallel search at a N[0][0] with the Slaves T[1-3].
T[1] and T[3] are getting idle and T[2] starts its own SplitNode N[1][0] with T[1] and T[3] as slaves.
And so on...
What happens when T[0] finds a BetaCut a its SplitNode N[0][0]. How is it possible to stop all the threads which a connected to this SplitNode? How do the Threads T[1-3] recognize which SplitNode sends the Stop information?
Fritz
bob wrote:You really don't want to "stop the threads". You just want to tell 'em to "stop searching and wait for something new to work on." The overhead for creating threads is non-trivial if you want to generalize it for multiple architectures that include NUMA. I have a big structure that is used for a thread that is searching something, and there is a "stop" variable that is usually set to zero. If some thread working at this split point wants to stop all threads that are busy here, he can look in his own split block to see what other threads are working here, then go set their individual "stop" flags and then clean up. They will stop within one node when they check the flag at the top of Search() or Quiesce()...
bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.
Richard Pijl wrote:bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.
Sorry, but I do not agree here.
The Baron has run successfully with a simple shared hashtable in all the recent tournaments, on a machine with 4 or even 8 processors. Agreed, the speedup of YBW is better on >2 machines, but the shared hashtable stuff does work here too and the performance increase is significant. At least for the Baron it does.
Last time I measured, 4CPU cores gave a speedup of 2.5, 8 CPU cores a speedup of 4.
Richard.
Fritz Reul wrote:I tested YBWC on my 2 Core System and it isnt stronger/faster/deeper than the Shares Hashtables.
Here are my latest results from the SharedHashtables on Quad:
LOOP A0 T2 (2 Threads) vs. LOOP A0 T4 (4 Threads):
12 Games
15+10
256 MB per Engine
Intel Woodcrest 4x3000 MHz
LOOP A0 T2: 3987 Knps / 18.95 Plies
LOOP A0 T4: 7761 Knps / 19.53 Plies
Result:
LOOP A0 T4 reaches ~95% higher Node Speed and searches 0.5-0.6 Plies deeper.
This SharedHashtable System is not tuned!
What do you think about these results?
Average Branching factor of LOOP A0 is ~2.5.
Peter Fendrich wrote:What is Shared Hashtable?
/Peter
Peter Fendrich wrote:What is Shared Hashtable?
/Peter
bob wrote:Richard Pijl wrote:bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.
Sorry, but I do not agree here.
The Baron has run successfully with a simple shared hashtable in all the recent tournaments, on a machine with 4 or even 8 processors. Agreed, the speedup of YBW is better on >2 machines, but the shared hashtable stuff does work here too and the performance increase is significant. At least for the Baron it does.
Last time I measured, 4CPU cores gave a speedup of 2.5, 8 CPU cores a speedup of 4.
Richard.
That curve is already looking bad. 2.5/4, 4/8, next is something like 6/16 which is really lousy. I will say 2.5/4 is not bad, but I'd like to see data for a bunch of test positions as that is better results than anyone else has produced, and I am assuming you are talking about the "ABDABA" or whatever the acronym for this is???
Gerd Isenberg wrote:Peter Fendrich wrote:What is Shared Hashtable?
/Peter
Hi Peter,
two or more threads/processes share a hashtable, but are otherwise searching the same position from root independently. They "randomly" gain from hash-entries stored by other threads. Likely the different treads will use slightly different move sorting to somehow improve parallel speedup by searching most disjoint subtrees. See also:
J.-C. Weill. The ABDADA Distributed Minimax-Search Algorithm. ICCA Journal, 19(1):3–14, (1996)
Gerd
Return to Programming and Technical Discussions
Users browsing this forum: No registered users and 24 guests