I am quite happy with the performance of the parallel search in Glaurung 1.2.1 on 4 CPUs or less. I don't have any precise numbers concerning the exact speedup, but the rating improvements with 2 and 4 CPUs compared to 1 CPU on the CCRL are quite nice: 50 Elo points with 2 CPUs, and 90 Elo points with 4 CPUs. This is better than Hiarcs 11.1, Rybka 2.3.2a, Shredder 10 and Junior 10, so I can't complain.
Glaurung 2, however, seems to be less good in this respect. On 2 CPUs, there are no problems, but on 4 CPUs, the speed is almost the same as on 2 CPUs, according to my testers (I don't have a quad, so I can't test it myself). The N/s speedup when going from 2 to 4 CPUs is only about 10%. This is very strange, because the parallel search in Glaurung 1 and Glaurung 2 is almost exactly the same. It doesn't make sense to me that one performs much better than the other.
Does anyone have any idea what the problem could be? I am quite sure it is not excessive locking/unlocking. General advice about low-level optimization (and profiling) of multi-threaded programs is also welcome. My high-level algorithms seem to be efficient and reasonably bug-free, at least compared to other programs of similar strength.
A mostly unrelated question about scaling: A few days ago, I upgraded from a Core Duo 2 GHz to a Core 2 Duo 2.8 GHz. On the Core Duo, the N/s speedup with two threads was about 1.73, but on the Core 2 Duo it is about 1.92. Is this normal?
Tord