Page 1 of 1

Weak or strong YBWC?

PostPosted: 19 Jul 2006, 21:33
by Tord Romstad
Hi all,

I learned everything I needed for implementing my parallel search from this paper by Valavan Manohararajah. Like most programmers (Anthony is the most notable exception), I chose to base my parallel search on the "Young Brothers Wait Concept", which seemed to offer the best tradeoff between efficiency and ease of implementation.

Valavan distinguishes between two different forms of YBWC: Weak YBWC, where all moves after the first are searched in parallel, and strong YBWC, where parallel search is postponed until all "promising" moves have been searched in a class of nodes he calls "Y-CUT nodes". In order to keep things simple, I have only tried weak WBWC so far.

Does strong YBWC perform noticably better than weak YBWC on a small number of CPUs (8 or less)? If the answer is yes, what is the best definition of "promising moves" for chess? Winning captures and killer moves, perhaps?

Tord

Re: Weak or strong YBWC?

PostPosted: 20 Jul 2006, 05:38
by Daniel Shawul
Hi Tord
I tried both and another version of it (what feldmann called xYBWC). Currently i use xYBWC, so basically I start parallel search immediately at ALL nodes. At CUT nodes, parallel search is started after hash table move, winning captures and killers are searched.
Honestly i don't see a difference in speed up on a dual pc's , all of them give me 1.7 average. Feldmann himself does not get significant improvement from this, YBWC gave him 137 from 256 processors, and YBWC 140 from xYBWC.
I am wondering why my quad cpu speed up very much sucks. I get a speed up of 2.2/4 which is far from what i expected! Currently i am trying to use a work stealing techinque, ie let the processors search for split blocks and select the better one, and join there.
I really appreciate it if some one with experience explain this to me.
regards,
Daniel

Re: Weak or strong YBWC?

PostPosted: 20 Jul 2006, 07:18
by bob
note that DTS _is_ based on YBW. The main point to DTS is that there is more flexibility, since the usual YBW is implemented as a simple test at any node, only done after one move has been completely searched. DTS just extends that to the ultimate level to say "OK, I now have an idle processor, this processor can jump in and help at _any_ node where the "YBW" condition (one move has been completely searched) has been satisfied, if such a node exists. If not, DTS will pick a likely split candidate and split anyway, where a traditional YBW program will have idle processors for brief periods of time...

But DTS is not a radical departure from YBW, which is just a re-statement of the original PVS (principal variation split) many of us used in the late 70's and early 80's for parallel search...

Re: Weak or strong YBWC?

PostPosted: 20 Jul 2006, 08:19
by Scott Gasch
Tord Romstad wrote:Hi all,
Does strong YBWC perform noticably better than weak YBWC on a small number of CPUs (8 or less)? If the answer is yes, what is the best definition of "promising moves" for chess? Winning captures and killer moves, perhaps?


I do not know the answer to the first part of your question. But as to the second part, I use this condition:

Code: Select all
        //
        // Can we search the remaining moves in parallel?
        //
        if ((((uLegalMoves >= 1) && ((iAlpha + 1) != iBeta)) ||
             ((uLegalMoves >= 3) && (iAlpha == iInitialAlpha))) &&
            (0 != g_uNumHelpersAvailable) &&
            (uDepth >= TWO_PLY))
        {


The results of this are pretty good, IMHO. Here's a sample taken from the first few positions in last night's ecm run:

Code: Select all
Split 2142 times total (~ 107.05x/sec).
  ...97 ( 4.53 percent) terminated early.
Helper thread 0: 98.00 percent busy.

Split 3491 times total (~ 173.66x/sec).
  ...153 ( 4.38 percent) terminated early.
Helper thread 0: 98.68 percent busy.

Split 2306 times total (~ 114.92x/sec).
  ...90 ( 3.90 percent) terminated early.
Helper thread 0: 98.86 percent busy.

Split 2377 times total (~ 117.98x/sec).
  ...53 ( 2.23 percent) terminated early.
Helper thread 0: 98.73 percent busy.

Split 1716 times total (~  85.08x/sec).
  ...94 ( 5.48 percent) terminated early.
Helper thread 0: 98.48 percent busy.

Split 2848 times total (~ 142.16x/sec).
  ...30 ( 1.05 percent) terminated early.
Helper thread 0: 98.35 percent busy.


Note that I'm running on a dual proc machine with one main thread and one helper thread. A "split" terminating early basically means one of the threads working on it failed high and I had to abort the other thread's search. This is reasonably rare -- you can improve it by messing with the condition above but I am happy with this balance.

As an aside, I've messed around with trying to use the same fail high "history" counters people are using for LMR to help pick where it is good to split the tree but never got this working as well as the simple conditions above. With good move ordering, though, you're reasonably sure that if you will fail high it will happen in the first few moves... therefore if you don't fail high in the first few moves you're probably not going to.

Also note that g_uNumHelpersAvailable is volatile and, on 4cpu machines, by the time this thread gets around to splitting the tree it may have to go it alone (if all available helpers got tied up working on another split). I've not done a lot of testing on 4cpu machines but this seems to work reasonably well.

Scott

Re: Weak or strong YBWC?

PostPosted: 20 Jul 2006, 16:12
by bob
I would also add that your "2 ply" limit is not the best solution. I have such a parameter, but it is tunable, and the best (optimal) value varies depending on the hardware. For example, on NUMA boxes, a larger number works better. I currently have a "percentage" value rather than an absolute depth remaining, because in fine 70 you do _not_ want to split 2 plies from the tips, the tree is _very_ narrow and that will produce way too many splits and the associated overhead. I believe I am currently using 70% (I will check later) which says "if less than 30% of the total depth is left, do not split..."

Re: Weak or strong YBWC?

PostPosted: 21 Jul 2006, 04:52
by Scott Gasch
Yes, that makes sense, Bob. Thanks for the tip. I'll run some tests on various settings and post my results back here.

Re: Weak or strong YBWC?

PostPosted: 21 Jul 2006, 12:30
by Tord Romstad
Hi all,

Thanks for the feedback!

Daniel Shawul wrote:I am wondering why my quad cpu speed up very much sucks. I get a speed up of 2.2/4 which is far from what i expected!


That's not too bad compared to most other amateurs, I think. IIRC, I get around 2.4 on a quad Opteron, and something like 2.8 on a quad G5. It's difficult to optimise for these monster machines unless you have one to experiment with yourself.

Tord

Re: Weak or strong YBWC?

PostPosted: 21 Jul 2006, 12:32
by Tord Romstad
bob wrote:note that DTS _is_ based on YBW. The main point to DTS is that there is more flexibility, since the usual YBW is implemented as a simple test at any node, only done after one move has been completely searched. DTS just extends that to the ultimate level to say "OK, I now have an idle processor, this processor can jump in and help at _any_ node where the "YBW" condition (one move has been completely searched) has been satisfied, if such a node exists. If not, DTS will pick a likely split candidate and split anyway, where a traditional YBW program will have idle processors for brief periods of time...

But DTS is not a radical departure from YBW, which is just a re-statement of the original PVS (principal variation split) many of us used in the late 70's and early 80's for parallel search...


I agree. It's just a further refinement, and not a radical departure. But DTS looks considerably messier to implement, which is why I chose not to use it for my first implementation.

Tord

Re: Weak or strong YBWC?

PostPosted: 21 Jul 2006, 14:26
by bob
that's also why I don't use it. I didn't want to dump the recursive search, which is necessary to implement DTS without a _lot_ of extra hassles... DTS came about due to the 16 and 32 CPU Cray machines we used in the 80's and early 90's, and I suppose I might have to bite the bullet one day since we are expecting quad quad-cores and 8-way quad cores for next year...