Winboard Forum

by **Tord Romstad** » 27 Jul 2006, 00:20

Hi all,

Today, I had the chance to play for a few hours with the quad Opteron I will use in Mainz next month (thanks, Ray!). I ran several positions (positional, tactical, and endgame) with 1, 2, 3 and 4 threads, and compared the results. To my big surprise, it turns out that the number of nodes needed to complete a given depth usually decreases when I increase the number of threads. This is, of course, exactly the opposite of what one would expect.

Unfortunately, I doubt that this result is caused by some revolutionary breakthrough in parallel search from my side. My parallel search is nothing more than an unoptimised, quick and dirty implementation of YBWC, without any unusual tricks. I therefore suspect that there must be some nasty bug or inefficiency waiting to be discovered in my search. Something that hurts a lot with just a single thread, but not so much with a bigger number of threads.

Does anybody have any idea what the problem could be? Awful move ordering, perhaps?

Tord

by **bob** » 27 Jul 2006, 03:53

This is actually something you should see here and there, just not frequently. If it is consistent, then it is time to start chasing bugs...

No real idea as to what would cause this without intimately knowing the internals of your program, which would be impossible in any reasonable amount of time... For parallel algorithms, the only good debug tool is between your ears, unfortunately. Nothing else works very well.

by **mjlef** » 27 Jul 2006, 08:17

My guess is that LMR runs very well with multiple processors, and having a global has table improves move ordering, and your speedups might come from there.

Suggestions:

Compare search speed with best move from has table disabled between single and multiple processor searches. You might be able to see if that is the source of the extra speedup. You could try the same with LMR turned off. We do not know what Rybka, SHredder, etc are doing, so your search algorithms might be responsible.

If you figure this out, let me know. Maybe I should SMP my program if it holds to be true.

Mark

by **bob** » 27 Jul 2006, 19:45

the problem is, it _can't_ hold up.

By logical induction, two processors can run twice as fast. If you get more, then your one-processor algorithm sucks. Simple theory says that if you get more than 2x the speedup with 2 cpus, then there is a better single-cpu algorithm.

For example, pretend you have two processors. Search one node with one thread, then one node with the other, then one with the first, one with the second, and you should get a faster algorithm even with no extra processor. If so, that just means your single-cpu algorithm (original) is just not very good...

by **Tord Romstad** » 28 Jul 2006, 14:49

bob wrote:This is actually something you should see here and there, just not frequently. If it is consistent, then it is time to start chasing bugs...

Exactly. I haven't tried a lot of positions (something like 15-20, I think) and it is still possible that the strange results are just a weird statistical fluctuation, but it does look suspicious.

I didn't count very carefully, but I think the number of nodes with 4 CPUs were lower than with 1 CPU about 60% of the time.

No real idea as to what would cause this without intimately knowing the internals of your program, which would be impossible in any reasonable amount of time... For parallel algorithms, the only good debug tool is between your ears, unfortunately. Nothing else works very well.

In this case, I suspect that the bug is not in the parallel search. If it was a parallel search bug, it should have some negative impact on the performance of my program on 2 or 4 CPUs compared to on 1 CPU. The rating difference between 1 and 2 CPUs should be somewhat smaller than you would expect when only considering the nodes/second speedup. If anything, the opposite seems to be true. The results I have seen so far indicates that my program gains more from 2 CPUs than it should.

I therefore suspect that I have a general search bug (as opposed to a parallel search bug), and that using more than one thread somehow alleviates the effect of the bug.

Tord

by **Tord Romstad** » 28 Jul 2006, 14:53

mjlef wrote:My guess is that LMR runs very well with multiple processors, and having a global has table improves move ordering, and your speedups might come from there.

Perhaps, but in that case somebody else should have observed the phenomenon before.

Suggestions:

Compare search speed with best move from has table disabled between single and multiple processor searches. You might be able to see if that is the source of the extra speedup. You could try the same with LMR turned off. We do not know what Rybka, SHredder, etc are doing, so your search algorithms might be responsible.

Yes, I guess I should try that. Too much else to test right now, though ...

If you figure this out, let me know.

Of course!

Maybe I should SMP my program if it holds to be true.

You should do that, even if it does not hold to be true. Parallel search is great fun, and worth doing for that reason alone.

Tord

by **Tord Romstad** » 28 Jul 2006, 15:00

bob wrote:the problem is, it _can't_ hold up.

In theory, I think it can - for a sufficiently inefficient search. Assume, as an extreme (and silly) example a program where the second move at a cut node never fails high, but that the third move always produces an extremely fast beta cutoff. Assume also that the sub-tree below the second move is always extremely big. In this case, I think going from 1 to 2 threads should speed up the search by a factor bigger than 2.

On the other hand, it of course shouldn't happen with a efficient and bug-free search with decent move ordering.

Tord

by **Rolf Sua** » 28 Jul 2006, 17:19

Hi there, first post so be gentle...

I guess that Tord's description of the really inefficient search applies to heavily pruned/extended alpha-beta in a way. And of course bob is right when he says that this could be improved.

Suppose you are at a cut node, so you want to prove that this position won't be played into by finding a cut-move. Usual alpha-beta starts with a 'best' guess which move might work (e.g. based on hash or iid) and works on that move until either it reached its aim or proved that the move is not actually a cut-move. Note that it never bails out if the situation gets too complicated.

If an algorithm searches two (or more) potential cut moves interlaced (single thread) or in parallel and they differ vastly in sub-tree size, then this could be a speed-up compared to a sequential search (e.g. if on average the sub-tree size differs by a factor of 4 and the choice which move to search first in the sequential case is random, then the sequential search needs to search on average 5/2*smaller tree size, whereas an interlazed search would only need 2*smaller tree size - provided of course that both are cut-moves). The higher the relative difference in tree size, the more these savings can happen.

I believe that a lot of alpha-beta searchers have fairly symmetric trees and thus usually no super-linear speed-up can be observed, but if Glaurung has a very asymmetric search-tree (maybe Tord can comment on this), then that could explain the super-linear speed ups.
Maybe it's worth to code an abandon threshold at cut nodes:
try move x to get a cutoff but if it isn't obtained with a certain effort, then don't complete searching x but rather start with the next best guess y and maybe come back to x later if y,z,... don't produce easy cut-offs either.
Note that humans search like this as well when trying to prove that a node is in fact a cut-node.

I'll look into Glaurung's code and see how difficult that would be to implement.
Maybe Tord can give us some of the positions on which he sees the super-linear speed-ups.

by **Robert Allgeuer** » 28 Jul 2006, 19:24

Rolf Sua wrote:Hi there, first post so be gentle...

Welcome

Rolf Sua wrote:...
I believe that a lot of alpha-beta searchers have fairly symmetric trees ...

Why should this be the case?
I would assume that the size of sub-trees - due to extensions and reductions firing - follows a statistical distribution around an average tree size. The harder pruning and extending a program is, the wider I would expect the standard deviation of this distribution. It would follow that for hard pruners - like Glaurung - the effect you describe would be more pronounced than for e.g. programs that hardly extend/prune.

Robert

by **bob** » 29 Jul 2006, 03:43

Tord Romstad wrote:
bob wrote:the problem is, it _can't_ hold up.

In theory, I think it can - for a sufficiently inefficient search. Assume, as an extreme (and silly) example a program where the second move at a cut node never fails high, but that the third move always produces an extremely fast beta cutoff. Assume also that the sub-tree below the second move is always extremely big. In this case, I think going from 1 to 2 threads should speed up the search by a factor bigger than 2.

That was the reason for my "exception". But if it _does_ hold up, you should just do a pseudo-parallel search on a single cpu and get the _same_ smaller tree. And now your super-linear speedup disappears as it should...

If you search worst-first, a parallel search will almost always produce a super-linear speedup...

On the other hand, it of course shouldn't happen with a efficient and bug-free search with decent move ordering.

Tord

Righto...

by **Fritz Grau** » 31 Jul 2006, 09:48

Rolf Sua's idea of asymmetric search trees seems very plausible to me. Before I give another, I should warn the reader that I have no experience with parallel search and but very little with chess programming.

My explanation is be based on the way the hash tables are filled (if Glaurung uses a common hash table for all search threads, which I assume). Again, as I have not read Glaurung's code, I guess you use an implementation, where an entry corresponds to a statement like "position X has value >= y". The true statement in alpha-beta-search would be similar to "position X has value >= y provided that its value lies within the current alpha-beta-window." As (1) it would consume time to check the additional assumption, (2) the simplified statement is probably true in most cases, (3) problems with false mate values can be avoided, it seems reasonable to be sloppy and believe in the simplified (stronger!) statement, resulting in pruning.

Now in single-thread search, provided that the move ordering is good and best moves are searched first, the alpha-beta-windows of most recursive search calls narrow around the calculated value of the current position, so they are near to each other.
In parallel search, I expect more alpha-beta-windows to be rather far from the value of the position, as they should narrow around the values of the different positions that succeed the current one. It seems feasible to collect statistics on the distribution of alpha-beta windows to check this.

If the guess above is true, the hash table of a parallel search contains more amplified statements, leading to stronger pruning of potentially relevant moves. If so, there should be cases where the evaluation at a fixed depth is different in parallel search. Is that so?

by **bob** » 04 Aug 2006, 04:19

that is correct, yes.

Another common problem is "tree grafting". Normally you search position A and then position B, and in a sequential search, since B is searched last, it can't be used while searching A, if a transposition occurs. But in a parallel search, it is possible to complete B before A has been finished, and the result from B can be used to influence the score for A, changing the best move, or the score, or the PV, or anything.

Non-deterministic behavior is the name of the game.

by **diepeveen** » 09 Aug 2006, 00:48

Hi Tord,

Quite important is how you implemented things for your reductions. Like your reductions in some incarnations are path dependant. Perhaps not all processors have path dependant information and therefore conclude they don't need to undo reductions?

Which positions did you try and what search depths do we speak about?

For example if you take 7 ply and somehow 7 ply gets searched sequential instead of in parallel, then obviously you see a lineair speedup when not measuring component time.

For tactical trick positions where score each iteration goes up and up, using less nodes is not so surprising.

Let me email you a set of 213 normal positions taken from a bunch of games.

Try it there.

Note that besides the number of positions you look at, quite overwhelming interesting is the time needed too.

The number of nodes needed is of course interesting data.

Note that in Diep better scaling speedup in general doesn't mean a near to lineair scaling. One of the earlier parallel implementations in diep of the parallellism as i used in 1999-2002, was having a bad scaling, but had also a near to lineair speedup in terms of number of nodes a ply needed.

When a program scales better, the other processors simply split less efficient or already start searching before the hashtable entries have been updated.

That trivially means in either case a bit bigger overhead.

Current parallel implementation in Diep i'm just doing another test to determine the parallel speedup with the latest improvements in search i've done.

There is no other forward pruning other than nullmove there. Some types of forward pruning are very bad for speedup, others are very good for speedup.

Nullmove obviously being very bad, reductions being sometimes very good, because in a given position P where a flip happens, the odds are bigger that you find a quicker path (in TIME) to get a cutoff to a node.

This is all different from program to program of course.

Note that nothing about superlineair speedups is new. some people claim that superlineair speedups are possible in search even for big number of processors.

by **diepeveen** » 09 Aug 2006, 00:56

Tord when i think more about it, it's of course the case that your reductions have a hashtable bug, which potentially gives cutoffs where it shouldn't do. When searching parallel you lay more stress upon the hashtable, meaning a statistical bigger odds that this bug happens, resulting in less nodes.

Usually the parallel search is a combination of many factors. LMR might play a very big role here.

Have a source code branch somewhere with the latest code?

Vincent

by **Scott Gasch** » 17 Aug 2006, 19:46

My first guess with this kind of thing would be a bug in how you apply extensions in the glue routine that is responsible for giving out the moves at a split in parallel and calling back into the search again. If the extensions are not consistent with what would have been done with no split (i.e. if they are not applied at all?) then I could definitely see a decrease in nodecount.

Scott

Winboard Forum

Superlinear speedup

Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Re: Superlinear speedup

Who is online