Winboard Forum

Posted: **22 Aug 2007, 08:51**

I am quite happy with the performance of the parallel search in Glaurung 1.2.1 on 4 CPUs or less. I don't have any precise numbers concerning the exact speedup, but the rating improvements with 2 and 4 CPUs compared to 1 CPU on the CCRL are quite nice: 50 Elo points with 2 CPUs, and 90 Elo points with 4 CPUs. This is better than Hiarcs 11.1, Rybka 2.3.2a, Shredder 10 and Junior 10, so I can't complain.

Glaurung 2, however, seems to be less good in this respect. On 2 CPUs, there are no problems, but on 4 CPUs, the speed is almost the same as on 2 CPUs, according to my testers (I don't have a quad, so I can't test it myself). The N/s speedup when going from 2 to 4 CPUs is only about 10%. This is very strange, because the parallel search in Glaurung 1 and Glaurung 2 is almost exactly the same. It doesn't make sense to me that one performs much better than the other.

Does anyone have any idea what the problem could be? I am quite sure it is not excessive locking/unlocking. General advice about low-level optimization (and profiling) of multi-threaded programs is also welcome. My high-level algorithms seem to be efficient and reasonably bug-free, at least compared to other programs of similar strength.

A mostly unrelated question about scaling: A few days ago, I upgraded from a Core Duo 2 GHz to a Core 2 Duo 2.8 GHz. On the Core Duo, the N/s speedup with two threads was about 1.73, but on the Core 2 Duo it is about 1.92. Is this normal?

Tord

Posted: **22 Aug 2007, 20:31**

The sure way to find out what is wrong is to profile it on a 2 and 4 CPU machine and examine the profile results.

Probably, someone who has a 4 CPU machine can profile it for you and send you the results or post it here.

I imagine that a profile will easily show the bottleneck and the problem will be resolved.

The Intel profiler has a really nice feature to show bottlenecks (there is a big, fat, red line along the path that is slowing the program the most).

Posted: **23 Aug 2007, 02:28**

It can sometimes be something very simple. For example, newer processors use the MOESI cache-coherency protocol. If you have a single cache line with multiple values in it where each can be modified by a different thread, you will get great degradation of search speed because that cache line is back and forth every time it is modified and it is wildly expensive to do that.

I used to have an array of pointers for work to do, an array of node counts (one per thread) and so forth. Both were bad because multiple threads were modifying adjacent values (no loicks needed since any single value only gets modified by one thread) so the cache traffic was overwhelming.

There are many such issues to watch out for. Anything that gets modified needs to be at least 64 bytes away from any value that might be modified by another thread. Which means that any variables of the form xxx[thread_id] are bad unless a single element of XXX is at least 64 bytes long...

And yes, that scaling is not that unusual. The core-duo is a dog of a chip using a faked up 64 bit architecture. The core-2 is the way it ought to be..

Posted: **23 Aug 2007, 02:29**

Dann Corbit wrote:The sure way to find out what is wrong is to profile it on a 2 and 4 CPU machine and examine the profile results.

Probably, someone who has a 4 CPU machine can profile it for you and send you the results or post it here.

I imagine that a profile will easily show the bottleneck and the problem will be resolved.

The Intel profiler has a really nice feature to show bottlenecks (there is a big, fat, red line along the path that is slowing the program the most).

Actually it most likely won't help at all. The inefficiency is most likely dealing with shared memory and cache, which the profiler will show as evenly distributed across all threads. That won't help at all.

Posted: **23 Aug 2007, 07:46**

Hello Bob,

Thanks for the advice!

bob wrote:There are many such issues to watch out for. Anything that gets modified needs to be at least 64 bytes away from any value that might be modified by another thread. Which means that any variables of the form xxx[thread_id] are bad unless a single element of XXX is at least 64 bytes long...

That could indeed be the problem. I have the following definitions:

Code: Select all: struct Thread { SplitPoint *splitPoint; volatile uint64 nodes; volatile bool stop; volatile bool running; volatile bool idle; volatile bool workIsWaiting; volatile bool printCurrentLine; }; Thread Threads[THREAD_MAX];

The 'Thread' struct is less than 64 bytes big, and the Threads[] array is written to at every node (Threads[threadID].nodes is incremented). I'll try to introduce lots unused pad bytes at the end of the struct and see if it helps.

Are there any good books or online resources about this subject? I haven't been able to find anything interesting.

Tord

Posted: **23 Aug 2007, 07:47**

bob wrote:
Dann Corbit wrote:The sure way to find out what is wrong is to profile it on a 2 and 4 CPU machine and examine the profile results.

Probably, someone who has a 4 CPU machine can profile it for you and send you the results or post it here.

I imagine that a profile will easily show the bottleneck and the problem will be resolved.

The Intel profiler has a really nice feature to show bottlenecks (there is a big, fat, red line along the path that is slowing the program the most).

Actually it most likely won't help at all. The inefficiency is most likely dealing with shared memory and cache, which the profiler will show as evenly distributed across all threads. That won't help at all.

Yes, that's precisely the problem. The profiler output with multiple threads is almost exactly the same as with a single thread.

Tord

Posted: **23 Aug 2007, 19:37**

Tord Romstad wrote:Hello Bob,

Thanks for the advice!

bob wrote:There are many such issues to watch out for. Anything that gets modified needs to be at least 64 bytes away from any value that might be modified by another thread. Which means that any variables of the form xxx[thread_id] are bad unless a single element of XXX is at least 64 bytes long...

That could indeed be the problem. I have the following definitions:
Code: Select all
struct Thread { SplitPoint *splitPoint; volatile uint64 nodes; volatile bool stop; volatile bool running; volatile bool idle; volatile bool workIsWaiting; volatile bool printCurrentLine; }; Thread Threads[THREAD_MAX];

The 'Thread' struct is less than 64 bytes big, and the Threads[] array is written to at every node (Threads[threadID].nodes is incremented). I'll try to introduce lots unused pad bytes at the end of the struct and see if it helps.

Are there any good books or online resources about this subject? I haven't been able to find anything interesting.

Tord

I have not found anything that touches on this level of programming.

The obvious thing for your case is to simply pad that struct out to 64 bytes, so that your array of them is now 64 bytes per element.

just be _very_ careful of any declarations of the type:

[decltype] name[MAX_THREADS];

as that implies that each thread will be accessing a different element of that thing. Each element needs to be far enough away from the adjacent elements that there is no chance they end up on the same cache line.

Note that the compiler is clueless. so it is probably safest to be more conservative and pad on the front and back to make sure that a single element doesn't span two cache lines and then you have a 3-way contention, between A and B for the first line (which has part of A's data and part of B's, and between B and C for the second line. That would be ugly. I have used the "address diddle" trick to malloc a large block of memory, round the address up to the next multiple of 64, and then use blocks of 64 knowing that they now are on cache line boundaries...

Obviously locks are bad, and it is far better to have more locks than less, so that a lock is used for a specific thing, rather than for a group of things, to avoid too much contention. Again, the locks can fly around the L1/L2 caches killing performance. I usually embed any lock into the data structure it protects, which makes sure that lock is only accessed when that data structure is being protected to reduce contention.

Posted: **24 Aug 2007, 06:29**

I have had some luck at improving parallel search efficiency by limiting the amount of memory I/O near the leaf nodes. My hypothesis is that with four threads searching I am saturating the memory bus with stuff like:

1. requests for dynamic move ordering data
2. requests for hash table data
3. requests for eval hash data
4. requests for pawn hash data

When I scaled back just a little bit on this stuff the efficiency improved. I do not know whether it is because my hypothesis was correct of because it changed the cache behavior of the engine etc...

Good luck,
Scott

Posted: **24 Aug 2007, 15:46**

Hello Bob,

I have some hope that following your advice really helps: I tried adding 64 pad bytes to my 'Thread' struct, and my program runs faster than before. The difference in speed on my dual-core machine is tiny, but considering that my scaling on 2 CPUs was already very good, nothing more than a tiny improvement was possible. If I am lucky, the difference will turn out to be much bigger on quads.

bob wrote:The obvious thing for your case is to simply pad that struct out to 64 bytes, so that your array of them is now 64 bytes per element.

64 bytes per struct wouldn't be enough in general, would it? If the struct is only 64 bytes, the highest used byte in one struct will be less than 64 bytes away from the lowest used byte in the next struct, and there is still a risk that two threads will write to memory addresses with a distance below 64 bytes.

As far as I can see, I need to add at least 64 pad bytes in addition to the data already present in the struct.

just be _very_ careful of any declarations of the type:

[decltype] name[MAX_THREADS];

as that implies that each thread will be accessing a different element of that thing.

Fortunately, I don't have a lot of such arrays, apart from the previously mentioned Threads[] array. The only other such arrays in my current code are these:

Code: Select all: PawnInfoTable* PawnTable[THREAD_MAX]; MaterialInfoTable* MaterialTable[THREAD_MAX];

But in this case, I don't write to the pointers themselves during the search, but only to the (big) tables they point to, so I suppose there shouldn't be a problem.

Each element needs to be far enough away from the adjacent elements that there is no chance they end up on the same cache line.

Note that the compiler is clueless. so it is probably safest to be more conservative and pad on the front and back to make sure that a single element doesn't span two cache lines and then you have a 3-way contention, between A and B for the first line (which has part of A's data and part of B's, and between B and C for the second line. That would be ugly. I have used the "address diddle" trick to malloc a large block of memory, round the address up to the next multiple of 64, and then use blocks of 64 knowing that they now are on cache line boundaries...

Obviously locks are bad, and it is far better to have more locks than less, so that a lock is used for a specific thing, rather than for a group of things, to avoid too much contention. Again, the locks can fly around the L1/L2 caches killing performance. I usually embed any lock into the data structure it protects, which makes sure that lock is only accessed when that data structure is being protected to reduce contention.

I am fairly sure locking is not a problem for me, because I can't measure any difference in speed between spinlocks and pthread mutexes. No locking or unlocking takes place anywhere near the leaves.

Tord

Posted: **24 Aug 2007, 15:48**

Scott Gasch wrote:I have had some luck at improving parallel search efficiency by limiting the amount of memory I/O near the leaf nodes. My hypothesis is that with four threads searching I am saturating the memory bus with stuff like:

1. requests for dynamic move ordering data
2. requests for hash table data
3. requests for eval hash data
4. requests for pawn hash data

When I scaled back just a little bit on this stuff the efficiency improved. I do not know whether it is because my hypothesis was correct of because it changed the cache behavior of the engine etc...

I've tried all of these in the past, but none of them improved the performance on a dual-core CPU. Perhaps the situation would be different on a quad -- I will have to test some day.

Thanks for the advice!

Tord

Posted: **24 Aug 2007, 17:23**

Scott Gasch wrote:I have had some luck at improving parallel search efficiency by limiting the amount of memory I/O near the leaf nodes. My hypothesis is that with four threads searching I am saturating the memory bus with stuff like:

1. requests for dynamic move ordering data
2. requests for hash table data
3. requests for eval hash data
4. requests for pawn hash data

When I scaled back just a little bit on this stuff the efficiency improved. I do not know whether it is because my hypothesis was correct of because it changed the cache behavior of the engine etc...

Good luck,
Scott

I access everything normally in mine with no scaling problems. A hash/pawn-hash access is not going to be the deal-breaker as those tables are usually very large, making it unlikely that you will start a lot of cache-to-cache traffic invalidating/updating lines.

I don't hash in the q-search, but I do use the pawn hash there since I evaluate everywhere in the q-search.

The main thing is to carefully study any shared data that is actually modified by a thread, to uncover cache inefficiencies.

Posted: **24 Aug 2007, 17:27**

You are on the right track. Your only goal is to make sure that your < 64 byte struct fits into one cache line and then pad things so that each instance of the struct is guaranteed to be in a different cache line.

For < 64 bytes, I would malloc enough memory for all those structs I need, by rounding the size of one up to 64, and then adding 63 to that size for the malloc() request. Then take the pointer returned, add 63 to it and then and the rightmost 6 bits to zero. Now the first struct starts on an address divisible by 64, and each instance is exactly 64 since you padded it, making things lay out perfectly with no overlap, every entry in a separate cache line. Note that some Intels have 128 byte cache lines, like the PIV, which changes this just a bit.

Winboard Forum

Questions about scaling on multiple CPUs

Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs

Re: Questions about scaling on multiple CPUs