Winboard Forum

Posted: **17 Aug 2006, 19:41**

Hi,

I recently got a dual proc dual core machine and set off to get my engine (which already worked on a dual proc machine) to work on this new quad beast.

The scheme I use is to keep a pool of "helper threads" and a pool of "split structs". When any thread reaches a point in the tree which looks like an all node it:

1. checks to see if there are idle helpers
2. populates a split struct
3. and assigns the helpers to help it search in parallel.

A thread _can_ resplit the tree if it is already under a split and threads are idle. This helps to minimize the idle time of helper threads.

On a dual proc machine this works reasonably well. Raw NPS increases proportionally to the number of searcher threads, all helper threads are kept busy 95%+ of the time, and overall speedup is about 1.5x. There's room for improvement here but its not too bad.

Searching with 4 threads works... but the raw speedup is less than I had hoped. The overall speedup with 4 threads is only about 2.25x. The raw speedup with 3 threads is 1.85x. Of course the raw NPS scales roughly linearly in both cases. Again, all helper threads are busy ~95% of the time.

I've checked things like lock contention and the speed with which a helper thread can abort the search if a split above it is terminated early. I have convinced myself that these are both reasonable.

I'm keeping track of the number of splits that terminate early and, across a big test suite like ecm, it averages around 6%.

One thing I have found is that if I drastically reduce the size of the hash table there is a reasonably large speed increase. I am operating under the theory that part of the problem is memory contention.

I'm wondering if anyone else has played much with parallel search (esp on a 4 proc box) and has results to share or advice about how to minimize parallel overhead.

Thx,
Scott

Posted: **17 Aug 2006, 23:22**

Scott Gasch wrote:Hi,

I recently got a dual proc dual core machine and set off to get my engine (which already worked on a dual proc machine) to work on this new quad beast.

The scheme I use is to keep a pool of "helper threads" and a pool of "split structs". When any thread reaches a point in the tree which looks like an all node it:

1. checks to see if there are idle helpers
2. populates a split struct
3. and assigns the helpers to help it search in parallel.

A thread _can_ resplit the tree if it is already under a split and threads are idle. This helps to minimize the idle time of helper threads.

On a dual proc machine this works reasonably well. Raw NPS increases proportionally to the number of searcher threads, all helper threads are kept busy 95%+ of the time, and overall speedup is about 1.5x. There's room for improvement here but its not too bad.

Searching with 4 threads works... but the raw speedup is less than I had hoped. The overall speedup with 4 threads is only about 2.25x. The raw speedup with 3 threads is 1.85x. Of course the raw NPS scales roughly linearly in both cases. Again, all helper threads are busy ~95% of the time.

I've checked things like lock contention and the speed with which a helper thread can abort the search if a split above it is terminated early. I have convinced myself that these are both reasonable.

I'm keeping track of the number of splits that terminate early and, across a big test suite like ecm, it averages around 6%.

One thing I have found is that if I drastically reduce the size of the hash table there is a reasonably large speed increase. I am operating under the theory that part of the problem is memory contention.

I'm wondering if anyone else has played much with parallel search (esp on a 4 proc box) and has results to share or advice about how to minimize parallel overhead.

Thx,
Scott

Hi Scott,

Nice to see you're busy parallel.

You mention 'idle thread'. What do you mean with an idle thread.

a) a thread that is very busy polling; so not idle, it just has no job to do
b) a real idle thread that needs to get put in what in linux we would call the 'runqueue'.

i'd note that i'm using A. In linux that's a 10 ms latency in case of B which A doesn't have.

c) how much work needs to get done to do a split?
in short how much bytes do you ship and how many locks and other funky windows commands do you use to split?

Get rid of windows function calls i'd say.

In Diep when doing a split i basically only need to ship a few moves to a processor that's idle (in short not an idle processor but a polling processor which polls in its own local memory).

d) you mention you work with threads. Everyone working with threads reports scaling problems with windows. Everyone working with processes directly works great. I don't say threads are not going to work, i just want to note that it's a lot easier for most programmers to get multiprocessing to work above multithreading. Besides that multithreading loses you a register (which nowadays with all those rename registers might not be that relevant in 64 bits like it used to be in the past with processors). Just now and then you gotta restart all your processes as the shared memory gets somehow slow by the kernel. you know more there than i do why the memory gets slow. A reboot of the machine helps a lot. In linux there is not a problem at all. Multiprocessing easier lets your program scale better then. Perhaps it's harder to fix things with threads than with processes?

e) with regards to the speedup, the usual way to split is to only split using the YBW (young brother wait) principle. In diep one of the miminum conditions to split at under or equal to 16 processors is to split only when i have already searched nullmove, done IID (if that gets used anyway), and searched the first move in the movelist.

Only THEN it's possible to do a split sometimes (there is other conditions btw which are less relevant).

This YBW principle works great.

p.s. it's all about at which memory node you allocate the memory at an AMD box.

Posted: **18 Aug 2006, 01:17**

diepeveen wrote:You mention 'idle thread'. What do you mean with an idle thread.

a) a thread that is very busy polling; so not idle, it just has no job to do
b) a real idle thread that needs to get put in what in linux we would call the 'runqueue'.

I mean the former. The helper threads are spinning looking for work. As you observe, scheduler latency is too high to sleep on waitable objects.

c) how much work needs to get done to do a split?
in short how much bytes do you ship

A significant amount of work is done; I send all the moves from the root position to the split point in addition to the current board state. There is room for improvement here but I question how much of an overhead this is. I'm splitting maybe 200x per second. But it's certainly worth optimizing at some point.

and how many locks and other funky windows commands do you use to split? Get rid of windows function calls i'd say.

Zero windows function calls. Any locks involved are spin locks done in asm.

d) you mention you work with threads. Everyone working with threads reports scaling problems with windows. Everyone working with processes directly works great. I don't say threads are not going to work, i just want to note that it's a lot easier for most programmers to get multiprocessing to work above multithreading.

I'm using threads; native threads on Win32 and pthreads on linux. Doing separate processes doesn't make much sense to me especially on NT where a scheduler unit of allocation is a thread.

Besides that multithreading loses you a register (which nowadays with all those rename registers might not be that relevant in 64 bits like it used to be in the past with processors).

What register is this?

e) with regards to the speedup, the usual way to split is to only split using the YBW (young brother wait) principle. In diep one of the miminum conditions to split at under or equal to 16 processors is to split only when i have already searched nullmove, done IID (if that gets used anyway), and searched the first move in the movelist.

Only THEN it's possible to do a split sometimes (there is other conditions btw which are less relevant).

This YBW principle works great.

This is very similar to what I am doing. As I said in my first post I am seeing about 6% of my splits have to be terminated early on average. The worst case I have seen in an ECM/20 run is 15% terminate early. Are these results typical of what you are seeing?

p.s. it's all about at which memory node you allocate the memory at an AMD box.

This box is a dual dual-core woodcrest which, I believe, is shared memory bus(?) Are you allocating different structures out of per-node memory?

Thx,
Scott

Posted: **18 Aug 2006, 01:34**

Scott Gasch wrote:
diepeveen wrote:You mention 'idle thread'. What do you mean with an idle thread.

a) a thread that is very busy polling; so not idle, it just has no job to do
b) a real idle thread that needs to get put in what in linux we would call the 'runqueue'.

I mean the former. The helper threads are spinning looking for work. As you observe, scheduler latency is too high to sleep on waitable objects.

c) how much work needs to get done to do a split?
in short how much bytes do you ship

A significant amount of work is done; I send all the moves from the root position to the split point in addition to the current board state. There is room for improvement here but I question how much of an overhead this is. I'm splitting maybe 200x per second. But it's certainly worth optimizing at some point.

and how many locks and other funky windows commands do you use to split? Get rid of windows function calls i'd say.

Zero windows function calls. Any locks involved are spin locks done in asm.

d) you mention you work with threads. Everyone working with threads reports scaling problems with windows. Everyone working with processes directly works great. I don't say threads are not going to work, i just want to note that it's a lot easier for most programmers to get multiprocessing to work above multithreading.

I'm using threads; native threads on Win32 and pthreads on linux. Doing separate processes doesn't make much sense to me especially on NT where a scheduler unit of allocation is a thread.

Besides that multithreading loses you a register (which nowadays with all those rename registers might not be that relevant in 64 bits like it used to be in the past with processors).

What register is this?

e) with regards to the speedup, the usual way to split is to only split using the YBW (young brother wait) principle. In diep one of the miminum conditions to split at under or equal to 16 processors is to split only when i have already searched nullmove, done IID (if that gets used anyway), and searched the first move in the movelist.

Only THEN it's possible to do a split sometimes (there is other conditions btw which are less relevant).

This YBW principle works great.

This is very similar to what I am doing. As I said in my first post I am seeing about 6% of my splits have to be terminated early on average. The worst case I have seen in an ECM/20 run is 15% terminate early. Are these results typical of what you are seeing?

p.s. it's all about at which memory node you allocate the memory at an AMD box.

This box is a dual dual-core woodcrest which, I believe, is shared memory bus(?) Are you allocating different structures out of per-node memory?

Thx,
Scott

Hi Scott,

AFAIK woodcrest is not NUMA. Well anyway, diep scales 100% on woodcrest AFAIK, though i'd like to see this tested. Can i ship you a diep version to see its scaling at woodcrest?

What is your email?
mine still is diep @ xs4all.nl

Diep allocates for each process a block of shared memory and puts all its datastructure inside it and majority of that memory is hashtable.

Then in a n^2 attachment all processes attach to the shared memory of other processes. There is not so much pointers inside the shared memory nowadays, depending upon how i compile, so not for all versions there is a big need for being in the same adress space.

Basically each process runs in its own memory. Woodcrest is exchanging ideas between cores between L1 caches if i understand well. Superior chip.

So because each process is just busy with its own as much as possible it's only bad thing is that it is doing transpositiontable lookup at each node with of course in case of 4 cores quite a bit less than 75% chance it's not on this core.

If it isn't it needs a request from RAM. At AMD this matters a lot. At woodcrest i have honestely no idea.

I do not know how well diep scales on woodcrest.
If you could test it, you could theorize further perhaps.

I'm guessing 4.0 scaling by the way.

Thanks,
Vincent

Posted: **18 Aug 2006, 02:20**

I'd bet the biggest problem is the dual-core bottleneck of getting to memory through one path. That means that each processor only has 1/2 the memory bandwidth of a normal single-core processor. You have to minimize memory references to survive on dual and quad-core boxes.

Posted: **18 Aug 2006, 06:16**

bob wrote:I'd bet the biggest problem is the dual-core bottleneck of getting to memory through one path. That means that each processor only has 1/2 the memory bandwidth of a normal single-core processor. You have to minimize memory references to survive on dual and quad-core boxes.

Yes this makes sense and my experiments seems to confirm it. When I do not bother to store fail high data for moves that fail high with low remaining depth (thus eliminating some lock contention and some contention for the memory bus) the engine speeds up a little to around 2.6x faster than a single proc searcher.

It seems to me like the big contenders for the memory bus will be: fail high history table, full eval cache and main hash table. I guess I'll play with limiting these guys somehow and observe what happens to the effective engine speedup.

Thanks for the info, Bob.

Scott

Posted: **18 Aug 2006, 06:24**

AFAIK woodcrest is not NUMA. Well anyway, diep scales 100% on woodcrest AFAIK, though i'd like to see this tested. Can i ship you a diep version to see its scaling at woodcrest?

What is your email?
mine still is diep @ xs4all.nl

Yeah I'd be glad to test it out -- it would be nice to have another engine to compare numbers with too. The machine is a mac pro -- do you have a binary for OSX? I'm building my codebase currently with the gcc version from the apple "xcode" package. Someday soon I'll download a trial of icc and see how much better that is. I may also put vmware or something on here soon so, depending on whether it supports 4cpus, I may be able to run a win32 binary.

scott.gasch @ gmail.com

So because each process is just busy with its own as much as possible it's only bad thing is that it is doing transpositiontable lookup at each node with of course in case of 4 cores quite a bit less than 75% chance it's not on this core.

If it isn't it needs a request from RAM. At AMD this matters a lot. At woodcrest i have honestely no idea.

Well if I understand the architecture of this thing (which is debatable) I think that there is no "local" memory for a cpu/core. All memory should be uniform because it goes over the same bus. That said, I believe one of my problems is that I am saturating that bus -- see my response to Bob.

I do not know how well diep scales on woodcrest.
If you could test it, you could theorize further perhaps.

I'm guessing 4.0 scaling by the way.

That would be quite amazing... I'm a bit skeptical.

Scott

Posted: **18 Aug 2006, 12:24**

bob wrote:I'd bet the biggest problem is the dual-core bottleneck of getting to memory through one path. That means that each processor only has 1/2 the memory bandwidth of a normal single-core processor. You have to minimize memory references to survive on dual and quad-core boxes.

Look what i found:

Code: Select all: 4 core Woodcrest system with 4 sticks of 2 GB 667 MHz FB-dimms: Function Rate (MB/s) Avg time Min time Max time Copy: 4813.2590 0.2663 0.2659 0.2675 Scale: 4826.2181 0.2656 0.2652 0.2662 Add: 5244.0135 0.3664 0.3661 0.3669 Triad: 5244.3106 0.3665 0.3661 0.3671 This was a pathscale compiled stream executable with openmp using 4 threads.

Though not near as good as dual opteron dual core, it is quite workable.

Posted: **18 Aug 2006, 14:50**

Scott Gasch wrote:
AFAIK woodcrest is not NUMA. Well anyway, diep scales 100% on woodcrest AFAIK, though i'd like to see this tested. Can i ship you a diep version to see its scaling at woodcrest?

What is your email?
mine still is diep @ xs4all.nl

Yeah I'd be glad to test it out -- it would be nice to have another engine to compare numbers with too. The machine is a mac pro -- do you have a binary for OSX? I'm building my codebase currently with the gcc version from the apple "xcode" package. Someday soon I'll download a trial of icc and see how much better that is. I may also put vmware or something on here soon so, depending on whether it supports 4cpus, I may be able to run a win32 binary.

scott.gasch @ gmail.com

So because each process is just busy with its own as much as possible it's only bad thing is that it is doing transpositiontable lookup at each node with of course in case of 4 cores quite a bit less than 75% chance it's not on this core.

If it isn't it needs a request from RAM. At AMD this matters a lot. At woodcrest i have honestely no idea.

Well if I understand the architecture of this thing (which is debatable) I think that there is no "local" memory for a cpu/core. All memory should be uniform because it goes over the same bus. That said, I believe one of my problems is that I am saturating that bus -- see my response to Bob.

I do not know how well diep scales on woodcrest.
If you could test it, you could theorize further perhaps.

I'm guessing 4.0 scaling by the way.

That would be quite amazing... I'm a bit skeptical.

Scott

All wishlists fit in 4MB L2. Should be really 4.0 scaling for Diep.
Diep is win32 though.

You sure this VMWARE isn't killing your apps performance?

I do have testresults from diep at 4 cores woodcrest already for a year now or so.

Just not scaling numbers.

I was at the time just interested at how fast the chip was going to get.

Diep scales 4.0 at itanium2 with 6MB L2.
Diep scales 4.0 at R14000.
Diep scales 3.92 at dual opteron dual core
Diep scales 3.94 at quad opteron single core

So it's interesting to know woodcrest. I bet 4.0.
Just a guess. Wouldn't know else how to get a 20% higher IPC otherwise (4 cores versus 4 cores, so not single core compares).

What i do not know from woodcrest is whether you can do reads in parallel from the memory controller, nor do i know its single core performance.

Those highend machines can do reads in parallel usual. Not sure about woodcrest.

It's an amazing chip.

Posted: **19 Aug 2006, 05:36**

Scott Gasch wrote:
bob wrote:I'd bet the biggest problem is the dual-core bottleneck of getting to memory through one path. That means that each processor only has 1/2 the memory bandwidth of a normal single-core processor. You have to minimize memory references to survive on dual and quad-core boxes.

Yes this makes sense and my experiments seems to confirm it. When I do not bother to store fail high data for moves that fail high with low remaining depth (thus eliminating some lock contention and some contention for the memory bus) the engine speeds up a little to around 2.6x faster than a single proc searcher.

It seems to me like the big contenders for the memory bus will be: fail high history table, full eval cache and main hash table. I guess I'll play with limiting these guys somehow and observe what happens to the effective engine speedup.

Thanks for the info, Bob.

Scott

You get burned _badly_ on shared data of any kind. the MOESI cache coherency protocol can get swamped if multiple processors are modifying the same memory values, as might happen in a global/shared history table. So shared data has to be strictly controlled as to how frequently you modify it. Accessing shared data is fine as it can get cached, but once you modify it, a whirlwind of cache-to-cache traffic occurs to maintain coherency.

Large data structures are equally bad as if they can't fit into cache, they have to come from memory, and you are now using more of that memory bandwidth that is scarce enough already...

And finally you have to be careful and be sure that data you are using frequently is stored on the local node's memory...

Posted: **21 Aug 2006, 05:24**

Progress report: my 4 thread searcher efficiency is up over 3x now (barely). I'm still messing with the algorithm to try to squeeze out any extra performance.

One thing I need to turn my attention to now is how the raw nps scales. I said it was "linear" in my previous post. Well, it's linear but not exactly x=y. I'm seeing only about a 85% improvement in nps per searcher thread I add. I attribute some of this to searcher thread busyness (which averages around 95%) and some of this to lock contention. So I'm trying to improve both in the hope that doing so will also improve overall parallel efficiency now.

Thanks for the ideas, everyone.

Scott

Posted: **21 Aug 2006, 12:35**

Scott Gasch wrote:Progress report: my 4 thread searcher efficiency is up over 3x now (barely). I'm still messing with the algorithm to try to squeeze out any extra performance.

One thing I need to turn my attention to now is how the raw nps scales. I said it was "linear" in my previous post. Well, it's linear but not exactly x=y. I'm seeing only about a 85% improvement in nps per searcher thread I add. I attribute some of this to searcher thread busyness (which averages around 95%) and some of this to lock contention. So I'm trying to improve both in the hope that doing so will also improve overall parallel efficiency now.

Thanks for the ideas, everyone.

Scott

hi Scott,

the problem is not the memory latency.

Memory latency at dual opteron dual core is roughly 160-200 ns to get a byte or 8 @ random from 2 GB at 4 processes.

At dual woodcrest dual core the latency is 135 ns at 2.6Ghz woodcrest chips to get 8 bytes @ random with 4 processes.

Latency of the DDRII ram and that chipset is completely superior to DDR ram (though probably logical to you, with RDRAM in past this was not so logical).

So this is an ideal machine for computerchess.

The memory bandwidth is limited but the latency is superior.
Additional 4 MB of shared L2.

Vincent

Posted: **21 Aug 2006, 16:02**

Scott Gasch wrote:Progress report: my 4 thread searcher efficiency is up over 3x now (barely). I'm still messing with the algorithm to try to squeeze out any extra performance.

One thing I need to turn my attention to now is how the raw nps scales. I said it was "linear" in my previous post. Well, it's linear but not exactly x=y. I'm seeing only about a 85% improvement in nps per searcher thread I add. I attribute some of this to searcher thread busyness (which averages around 95%) and some of this to lock contention. So I'm trying to improve both in the hope that doing so will also improve overall parallel efficiency now.

Thanks for the ideas, everyone.

Scott

Here is the thing to watch out for:

Cache linesize varies. L1 linesize is 32 bytes. L2 can be 64 (AMD) or 128 (more recent Intels). The problem with this is the MOESI cache coherency protocol. If you have four threads, and each increments a counter for every node, and the 4 counters are in adjacent memory addresses, this will crush performance since even though each _value_ is not being shared, they live in a common cache line, and cache coherency is line-oriented, not word/byte oriented.

Make certain that any values that are close together in memory are not modified by different threads, when possible. This will greatly reduce the cache-to-cache communication and help NPS scaling significantly...

Winboard Forum

Parallel search efficiency

Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency

Re: Parallel search efficiency