Winboard Forum

by **Fritz Reul** » 21 Feb 2007, 08:39

I am comparing YBWC and Shared Hashtables with each other. The Shared Hashtables perform quite good on 2 core system. The YBWC are not much faster but more complicated.

So far LOOP is ready to use Shared Hashtables or YBWC. But I am currently not shure which technology is really better. Shared Hashtables are simpler and therefore the engine is much easier to develop and tune.

My biggest problem is the YBWC with >=4 Threads. I dont really know how to stop the Master and Slave Threads efficiently when at the local SplitNode a BetaCut occurs.

If it is possible for each Thread to launch i.e. 8 SplitNodes and we use 4 Threads then we have to manage up to 4x8=32 recursive SplitNodes. Every Threads is a Master for its own SplitNode. But what happens, if a CutOff occurs in a SplitNode near the root? How is it possible to stop all the SlaveThreads of this SplitNode where the CutOff occured?

Example:
We have 4 Threads T[4] (0-3) and 8 SplitNodes N[4][8] (0-7) for every Thread.
T[0] is the master of its first SplitNode N[0][0].
The Slave Threads T[1-3] are idle.
T[0] launches a new parallel search at a N[0][0] with the Slaves T[1-3].
T[1] and T[3] are getting idle and T[2] starts its own SplitNode N[1][0] with T[1] and T[3] as slaves.
And so on...

What happens when T[0] finds a BetaCut a its SplitNode N[0][0]. How is it possible to stop all the threads which a connected to this SplitNode? How do the Threads T[1-3] recognize which SplitNode sends the Stop information?

Fritz

by **Daniel Shawul** » 21 Feb 2007, 10:19

Why don't you just call the stop_thread function recursively.
What i do in my program is lock the master and the general smp lock used for assigning work to idle threads, and then quit whatever workers it has.

Code: Select all: void SEARCHER::stop_workers() { for(int i = 0; i < n_processors; i++) { if(workers[i]) { l_lock(workers[i]->lock); workers[i]->stop_searcher = 1; workers[i]->stop_workers(); l_unlock(workers[i]->lock); } } } Now when a beta cut off occurs if(score >= pstack->beta) { l_lock(lock_smp); l_lock(master->lock); master->stop_workers(); update_master(); l_unlock(master->lock); l_unlock(lock_smp); break; }

Daniel

by **bob** » 23 Feb 2007, 06:11

Fritz Reul wrote:
I am comparing YBWC and Shared Hashtables with each other. The Shared Hashtables perform quite good on 2 core system. The YBWC are not much faster but more complicated.

So far LOOP is ready to use Shared Hashtables or YBWC. But I am currently not shure which technology is really better. Shared Hashtables are simpler and therefore the engine is much easier to develop and tune.

First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.

My biggest problem is the YBWC with >=4 Threads. I dont really know how to stop the Master and Slave Threads efficiently when at the local SplitNode a BetaCut occurs.

You really don't want to "stop the threads". You just want to tell 'em to "stop searching and wait for something new to work on." The overhead for creating threads is non-trivial if you want to generalize it for multiple architectures that include NUMA. I have a big structure that is used for a thread that is searching something, and there is a "stop" variable that is usually set to zero. If some thread working at this split point wants to stop all threads that are busy here, he can look in his own split block to see what other threads are working here, then go set their individual "stop" flags and then clean up. They will stop within one node when they check the flag at the top of Search() or Quiesce()...

If it is possible for each Thread to launch i.e. 8 SplitNodes and we use 4 Threads then we have to manage up to 4x8=32 recursive SplitNodes. Every Threads is a Master for its own SplitNode. But what happens, if a CutOff occurs in a SplitNode near the root? How is it possible to stop all the SlaveThreads of this SplitNode where the CutOff occured?

Example:
We have 4 Threads T[4] (0-3) and 8 SplitNodes N[4][8] (0-7) for every Thread.
T[0] is the master of its first SplitNode N[0][0].
The Slave Threads T[1-3] are idle.
T[0] launches a new parallel search at a N[0][0] with the Slaves T[1-3].
T[1] and T[3] are getting idle and T[2] starts its own SplitNode N[1][0] with T[1] and T[3] as slaves.
And so on...

What happens when T[0] finds a BetaCut a its SplitNode N[0][0]. How is it possible to stop all the threads which a connected to this SplitNode? How do the Threads T[1-3] recognize which SplitNode sends the Stop information?

Fritz

I create a "tree" of split blocks. One for the root position. If I split among N processors somewhere deeper in the tree, I allocate N new split blocks and thread them together (a threaded list). If any of those sub-trees are re-split later on, the same is done recursively. So it is possible for me to find any processors that are working at this split node, as well as any that are working further down in the tree represented by this split node, and I can stop 'em all easily by setting their stop searching flags...

by **Pradu** » 23 Feb 2007, 13:52

bob wrote:You really don't want to "stop the threads". You just want to tell 'em to "stop searching and wait for something new to work on." The overhead for creating threads is non-trivial if you want to generalize it for multiple architectures that include NUMA. I have a big structure that is used for a thread that is searching something, and there is a "stop" variable that is usually set to zero. If some thread working at this split point wants to stop all threads that are busy here, he can look in his own split block to see what other threads are working here, then go set their individual "stop" flags and then clean up. They will stop within one node when they check the flag at the top of Search() or Quiesce()...

So your saying that the cost of creating a thread is not negligible when taking into account the all the work the thread does? I've always planned to do it this way because it seems easy to implement instead of having it unroll back to an infinite loop but I'm not sure if it'd hurt speedup a lot...

by **Richard Pijl** » 23 Feb 2007, 14:31

bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.

Sorry, but I do not agree here.
The Baron has run successfully with a simple shared hashtable in all the recent tournaments, on a machine with 4 or even 8 processors. Agreed, the speedup of YBW is better on >2 machines, but the shared hashtable stuff does work here too and the performance increase is significant. At least for the Baron it does.
Last time I measured, 4CPU cores gave a speedup of 2.5, 8 CPU cores a speedup of 4.
Richard.

by **Fritz Reul** » 23 Feb 2007, 18:48

I tested YBWC on my 2 Core System and it isnt stronger/faster/deeper than the Shares Hashtables.

Here are my latest results from the SharedHashtables on Quad:

LOOP A0 T2 (2 Threads) vs. LOOP A0 T4 (4 Threads):
12 Games
15+10
256 MB per Engine
Intel Woodcrest 4x3000 MHz

LOOP A0 T2: 3987 Knps / 18.95 Plies
LOOP A0 T4: 7761 Knps / 19.53 Plies

Result:
LOOP A0 T4 reaches ~95% higher Node Speed and searches 0.5-0.6 Plies deeper.

This SharedHashtable System is not tuned!

What do you think about these results?

Average Branching factor of LOOP A0 is ~2.5.

by **Richard Pijl** » 23 Feb 2007, 22:23

I would say your numbers concerning the speedup are consistent with mine. I just wish I got the same average depth !
Richard.

by **bob** » 24 Feb 2007, 05:30

Richard Pijl wrote:
bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.

Sorry, but I do not agree here.
The Baron has run successfully with a simple shared hashtable in all the recent tournaments, on a machine with 4 or even 8 processors. Agreed, the speedup of YBW is better on >2 machines, but the shared hashtable stuff does work here too and the performance increase is significant. At least for the Baron it does.
Last time I measured, 4CPU cores gave a speedup of 2.5, 8 CPU cores a speedup of 4.
Richard.

That curve is already looking bad. 2.5/4, 4/8, next is something like 6/16 which is really lousy. I will say 2.5/4 is not bad, but I'd like to see data for a bunch of test positions as that is better results than anyone else has produced, and I am assuming you are talking about the "ABDABA" or whatever the acronym for this is???

by **bob** » 24 Feb 2007, 05:34

Fritz Reul wrote:I tested YBWC on my 2 Core System and it isnt stronger/faster/deeper than the Shares Hashtables.

Here are my latest results from the SharedHashtables on Quad:

LOOP A0 T2 (2 Threads) vs. LOOP A0 T4 (4 Threads):
12 Games
15+10
256 MB per Engine
Intel Woodcrest 4x3000 MHz

LOOP A0 T2: 3987 Knps / 18.95 Plies
LOOP A0 T4: 7761 Knps / 19.53 Plies

Result:
LOOP A0 T4 reaches ~95% higher Node Speed and searches 0.5-0.6 Plies deeper.

This SharedHashtable System is not tuned!

What do you think about these results?

Average Branching factor of LOOP A0 is ~2.5.

If that is true, I would suspect either a flaw in the testing, or else a bug in YBW. YBW is simply superior. On paper or on a computer. I've posted dozens of log files from my program in recent years. 4 cpu tests produce speedups of 3.1-3.3. The shared hash table approach won't do that well. For 8 the difference is even more pronounced...

I played with it a few years ago, since it is trivial to implement. But, as always, "you get what you pay for" and in the case of the shared hash table approach, that was never more true...

I'd expect the NPS to double as it did in your test. But that is meaningless in terms of search efficiency. My branching factor seems to hover around 2 nowadays. going from 1-2 processors gains a ply for me:

log.001: time=39.30 mat=0 n=79797592 fh=92% nps=2.0M
log.002: time=20.76 mat=0 n=76461479 fh=92% nps=3.7M

That's the same test position searched to a fixed depth... first with 1 cpu, then with two.

Or searched to a fixed time of 40 seconds, I get this:
17-> 39.35 -0.75 1. ... Bxe4 2. Bxe4 Qxc4 3. Qxc4 Rxc4
4. Nxb6 Rxe4 5. Nxd7 Nxd7 6. Bd4 d5
7. a4 Rd8 8. Rac1 Re2 <HT>

18-> 39.61 -0.71 1. ... Bxe4 2. Bxe4 Qxc4 3. Qxc4 Rxc4
4. Nxb6 Rxe4 5. Nxd7 Nxd7 6. Bc3 Ra4
7. Nd4 Bf6 8. Ne2 Re4 9. Ng3 Bxc3 10.
Nxe4 Bxa1 11. Rxa1

A whole ply.

My point here is that the shared hash table approach will _not_ produce those kinds of speedups. For 2 processors even. And the gap widens as the number of processors goes up...

It's not the way to design a program for long-term parallel search as hardware continues to evolve with respect to cores on a chip.

by **Peter Fendrich** » 24 Feb 2007, 11:25

What is Shared Hashtable?

/Peter

by **Gerd Isenberg** » 24 Feb 2007, 15:54

Peter Fendrich wrote:What is Shared Hashtable?

/Peter

Hi Peter,

two or more threads/processes share a hashtable, but are otherwise searching the same position from root independently. They "randomly" gain from hash-entries stored by other threads. Likely the different treads will use slightly different move sorting to somehow improve parallel speedup by searching most disjoint subtrees. See also:

J.-C. Weill. The ABDADA Distributed Minimax-Search Algorithm. ICCA Journal, 19(1):3–14, (1996)

Gerd

by **bob** » 24 Feb 2007, 16:49

Peter Fendrich wrote:What is Shared Hashtable?

/Peter

It's a "poor man's parallel search algorithm". Two threads start searching the same tree and depend on each other to provide information through the hash table that prevents them from searching _identical_ trees. In theory the speedup could be 1.0 for any number of processors as they would all search exactly the same tree at the same time. In reality it does somewhat better than that.

YBW on the other hand explicitly prevents this from happening, only suffering from the issue of doing unnecessary work when we split the tree at a point where we do not need to search all the moves. ABDADA will also do that just as frequently, but more importantly multiple threads will be searching the _same_ tree which greatly reduces performance.

ABDADA is pretty trivial to implement. But then again so is a bubble sort.

There are known better ways, but they are way more complex and require much more involved testing and debugging to get 'em right. But the reward is great.

by **Richard Pijl** » 24 Feb 2007, 17:29

bob wrote:
Richard Pijl wrote:
bob wrote:First, forget the shared hash table stuff. It will not work for > 2 processors and deliver any kind of performance advantage at all. Any form of YBW produces far better results. yes the code is way more complicated. But you don't get something good for nothing. Anywhere.

Sorry, but I do not agree here.
The Baron has run successfully with a simple shared hashtable in all the recent tournaments, on a machine with 4 or even 8 processors. Agreed, the speedup of YBW is better on >2 machines, but the shared hashtable stuff does work here too and the performance increase is significant. At least for the Baron it does.
Last time I measured, 4CPU cores gave a speedup of 2.5, 8 CPU cores a speedup of 4.
Richard.

That curve is already looking bad. 2.5/4, 4/8, next is something like 6/16 which is really lousy. I will say 2.5/4 is not bad, but I'd like to see data for a bunch of test positions as that is better results than anyone else has produced, and I am assuming you are talking about the "ABDABA" or whatever the acronym for this is???

I'm not contesting that YBW is better. It's just whether the cheap MP method would be a good alternative. In my opinion it is as it is relatively easy to implement without getting many of the problems that implementing YBW would give you.

My implementation is very simple:
- 4 (or 2, or 8 or any power of 2) processes
- Hashtable in shared memory
- separate (small) pawn hashtables
- Main process uses PVS, 2 of the child processes too, 1 process uses MTD(f) to get some variation in the search paths. Two years ago I experimented with more MTD(f) processes, but that was at that time a little worse.
- When given the command to search a position, all processes will search the same position. No synchronisation whatsoever.
- Lockless, 2-stage hashtable.

The Baron has a built-in benchmark, basically searching 10 different and relatively quiet positions to a fixed depth. I've just rerun this test today with a 4 CPU-core version.

1-CPU core:
201.80 seconds for 'bench 4'
460.08 seconds for 'bench 5' (one ply deeper for all positions)

4-CPU cores:
64.59 seconds for 'bench 4'
151.15 seconds for 'bench 5'

I did this test only once with the MP version, no average, best or whatever. Just the first.

Test machine:
dual Opteron 270 with 4Gb RAM. 1.5Gb used for hashtables with both versions.

This gives a speedup for these 10 position of on average 3.0-3.1.
I know, not a test with a large enough amount of test positions, but it surely is an indication of the potential of this method.

I do not have an 8-core machine or I would have rerun the test with that one as well.
Richard.

by **Richard Pijl** » 24 Feb 2007, 17:33

Gerd Isenberg wrote:
Peter Fendrich wrote:What is Shared Hashtable?

/Peter

Hi Peter,

two or more threads/processes share a hashtable, but are otherwise searching the same position from root independently. They "randomly" gain from hash-entries stored by other threads. Likely the different treads will use slightly different move sorting to somehow improve parallel speedup by searching most disjoint subtrees. See also:

J.-C. Weill. The ABDADA Distributed Minimax-Search Algorithm. ICCA Journal, 19(1):3–14, (1996)

Gerd

The method in this paper used the shared hashtable, but by synchronizing via the hashtable tries to improve the speedup. You can do without that synchronization and still have a nice speedup.

I'm quite surprised myself that is works that well for the Baron. otoh it doesn't seem to work well for everybody.
Richard.

Winboard Forum

Stopping Threads

Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Re: Stopping Threads

Who is online