Winboard Forum

by **Tord Romstad** » 09 Jun 2006, 11:57

Hi all,

To my frustration, people have reported all sorts of problems compiling and running my recent development versions of Glaurung with support for more than 2 threads under Windows, although it works fine in Linux and Mac OS X. I have now found that the reason is probably a really silly bug in the Windows version of my thread initialisation code. I have tried to correct it, but as I don't run Windows myself I cannot test the code. Can someone with at least superficial knowledge about POSIX and Windows threads confirm whether the following POSIX and Windows code are correct "translations" of each other?

POSIX:

Code: Select all: void idle_loop(int thread_id, split_point_t *wait_sp); static void *init_thread(void *thread_id) { idle_loop(*(int *)thread_id, NULL); return NULL; } void init_threads(int n) { // Lots of code snipped here for(i = 1; i < n; i++) { pthread_create(pthread, NULL, init_thread, (void *)(&i)); // Wait until the thread has finished launching: while(!Threads[i].running); } // Lots of code snipped here }

Windows:

Code: Select all: void idle_loop(int thread_id, split_point_t *wait_sp); static DWORD WINAPI win_init_thread(LPVOID thread_id) { idle_loop(*(int *)thread_id, NULL); return 0; } void init_threads(int n) { // Lots of code snipped here for(i = 1; i < n; i++) { DWORD iID[1]; CreateThread(NULL, 0, win_init_thread, (LPVOID)(&i), 0, iID); // Wait until the thread has finished launching: while(!Threads[i].running); } // Lots of code snipped here }

It would also be nice if someone with the Microsoft compiler could try to compile my current development version and let me know whether it works.

Tord

by **Jean-Francois Romang** » 09 Jun 2006, 13:06

Maybe they could use this Windows POSIX thread library to avoid the translation ?

http://sourceware.org/pthreads-win32/

by **Tord Romstad** » 09 Jun 2006, 13:19

Jean-Francois Romang wrote:Maybe they could use this Windows POSIX thread library to avoid the translation ?

http://sourceware.org/pthreads-win32/

I've been told that there is a big performance penalty when using the POSIX thread compatibility library for Windows. I prefer using native Windows threads when it is possible.

Tord

by **bob** » 09 Jun 2006, 13:55

most likely candidate for such bugs is "volatile". Are you certain you have the volatile modifier on the declaration for every variable that one thread uses but another thread can change spontaneously?

In particular, watch out for the following:

volatile int *p (p can spontaneously change, so it can not be loaded and kept in a register. although the value it points to can be loaded once and used many times).

int * volatile p (p can spontaneously change but the value it points to can not.)

volatile int * volatile p (p itself can spontaneously change as can the value it points to, so no optimization can be done to eliminate the memory references for either value).

It is probably not the windows thread stuff that's killing you, it is the optimizations done by the compiler. If you can get someone to compile it with optimization disabled and it works, you know this is the problem.

by **Tord Romstad** » 09 Jun 2006, 14:04

bob wrote:most likely candidate for such bugs is "volatile". Are you certain you have the volatile modifier on the declaration for every variable that one thread uses but another thread can change spontaneously?

No, it was a much more elementary and stupid bug than that. In Windows, my idle_loop() function would be called with a thread_id of 1 for all helper threads, which is incorrect, and will probably even cause the program to hang during startup when compiled with more than one thread.

The bug is already found, and I have attempted to fix it. I was not asking for help to find a bug, but rather for help in verifying that my corrected code works. As I don't use Windows, I need help from some Windows user with the Microsoft compiler to test it for me.

Tord

by **Bryan Hofmann** » 09 Jun 2006, 14:19

Tord

The source does compile clean under MS 2005 however it hangs after the first move is made. No errors appear, just the second thread dies and the first thread holds the CPU at 100% doing nothing. This is with a compile with no optimizations. I wish I could help more but I know squat about SMP code.

Bryan

by **Tord Romstad** » 09 Jun 2006, 14:41

Hi Bryan!

Bryan Hofmann wrote:The source does compile clean under MS 2005 however it hangs after the first move is made. No errors appear, just the second thread dies and the first thread holds the CPU at 100% doing nothing. This is with a compile with no optimizations. I wish I could help more but I know squat about SMP code.

Actually, it helps a lot! If you or somebody else has some more time to look at this, I have some additional questions:

Try inserting the following lines at the beginning of the dle_loop() function (in smp.cpp):
Code: Select all
if(wait_sp == NULL) printf("idle loop for thread %d started\n", thread_id);

What does the program print during startup?
If you run Glaurung in a debugger and interrupt it while it is hanging, at what location in the code does it stop?
What happens if you set the MaxNumOfThreads constant (in glaurung.h) to 2 rather than 4 and recompile? Does the program still hang after the first move is made?

Tord

by **Bryan Hofmann** » 09 Jun 2006, 15:21

Tord

The program is starting 4 threads (the system I am using is a 2 CPU no-hyper threading). When I change the MaxNumOfThreads = 2 everything works. It would seem that it is not correctly detecting the number of CPUs in the system. Also changing the number of threads in the options does not reduce the number of threads.

Bryan

by **Tord Romstad** » 09 Jun 2006, 15:36

Bryan Hofmann wrote:The program is starting 4 threads (the system I am using is a 2 CPU no-hyper threading). When I change the MaxNumOfThreads = 2 everything works.

Nice to know. Thanks!

It would seem that it is not correctly detecting the number of CPUs in the system.

It doesn't even try to detect the number of CPUs. If compiled with MaxNumOfThreads = 4, it will always start 4 threads. However, as long as the "Number of threads" UCI parameter remains at 2 (which is the default setting), two of the threads (the ones with thread_id 2 and 3) will be constantly sleeping. At least this is how it is supposed to work, and it works correctly in Linux and OS X.

Tord

by **bob** » 09 Jun 2006, 15:53

Some performance notes:

(1) spinlocks only. Do _not_ rely on mutexes or any O/S call that potentially blocks. It will kill performance, particularly as the number of threads go up and the number of block/unblock events escalates.

(2) while blocking idle threads is ok on a busy system, I _much_ prefer a simple spinloop as it takes a lot of time to unblock a process, which limits how deeply in the tree you can split without introducing more overhead to unblock than useful work done in the search.

Feel free to "lift" the spinlock code from Crafty. I lifted it from the Linux kernel source. And there is also a PPC version if you want to run on that platform, making portability a bit easier..

BTW the posix stuff for windows works exactly as well as the windows native thread stuff, if you do not use mutex locks (posix). The locks are where the overhead is, and using spinlocks solves that (Crafty uses my spinlock code in windows or unix).

And I'll bet you get burned by volatile at least once more in the future. I've been burned many times. X86 hides many volatile problems since there are hardly any registers to keep temp values in. My first run on an alpha hung badly, and debugging tracked it to a volatile bug... I had

volatile TREE *work_to_do[i]

When a processor was idle, it would do something like this:

while (!work_to_do[thread_id]);

and just spin. A thread wanting this thread to help would eventually do

work_to_do[idle_thread] = pointer_to_work_to_do;

and on X86 it worked fine. On alpha it hung, because I really meant:

TREE * volatile work_to_do[i]

which says the work_to_do[] values can change spontaneously and loads can not be optimized away. I did enough other work in the loop above (I didn't show all the tests being made) that X86 couldn't keep that work_to_do[] value in a reg and he to re-load anyway. On alphas (and sparcs and nowadays even X86-64 processors) this is not the case and it broke.

by **Tord Romstad** » 09 Jun 2006, 16:42

bob wrote:Some performance notes:

(1) spinlocks only. Do _not_ rely on mutexes or any O/S call that potentially blocks. It will kill performance, particularly as the number of threads go up and the number of block/unblock events escalates.

(2) while blocking idle threads is ok on a busy system, I _much_ prefer a simple spinloop as it takes a lot of time to unblock a process, which limits how deeply in the tree you can split without introducing more overhead to unblock than useful work done in the search.

Hello Bob,

Thanks for this very valuable advice. I just tried on my system (an Intel Core Duo 2 GHz running Mac OS X), and spinlocks don't seem to be faster than POSIX mutex locks. I searched the initial position for 100 seconds with two versions of my program, one using spinlocks and one using mutex locks. The speed was almost exactly the same: 568 kN/s for the mutex version and 563 kN/s for the spinlock version.

Perhaps the difference between spinlocks and mutex locks doesn't matter with only two CPUs, or perhaps the rest of my code is so awfully slow that the expense of locking becomes completely irrelevant.

Feel free to "lift" the spinlock code from Crafty. I lifted it from the Linux kernel source. And there is also a PPC version if you want to run on that platform, making portability a bit easier..

I just used the OSSpinLockLock() and OSSpinLockUnlock() functions. Seems to work well enough, except that performance isn't improved.

BTW the posix stuff for windows works exactly as well as the windows native thread stuff, if you do not use mutex locks (posix). The locks are where the overhead is, and using spinlocks solves that (Crafty uses my spinlock code in windows or unix).

That's useful to know. That spinlocks enable me to use POSIX threads in Windows without any performance penalty is almost a good enough reason to use them instead of mutex locks. I detest cluttering my code with #define directives, and ditching the Windows thread code would make things a lot cleaner. The question is, of course, if the spinlocks will introduce a similar number of new #defines in order to make everything work on all platforms.

And I'll bet you get burned by volatile at least once more in the future.

Sure. I've been burned many times in the past, and it's bound to happen again many times. I don't think the current problems in getting more than 2 threads working on Windows is related to this, though.

Tord

by **bob** » 09 Jun 2006, 16:53

That is somewhat surprising.

Here's why.

Typically the "critical sections" are just a few X86 instructions long, which means you spend X cycles acquiring the lock, and Y cycles executing the critical section, and then Z cycles clearing the lock.

With spinlocks Z=0, you just store a zero. X is just a couple of instruction cycles for spinlocks. Which means that for almost any case, X+Z is <= Y, and all are very short. If you use mutex locks, this changes. X and Z become sizable, because you are stomping through the operating system to block/unblock. There are versions of mutex locks that spin for a brief period of time and then block, and that pretty much solves this.

The main issue is how frequently do you acquire locks? I do it to share the move list at a split point, which is not huge. When I originally locked the hash table, it was a killer for performance since that is done so often. Might be if you limit your split depth enough, mutexes will work fine.

by **Tord Romstad** » 09 Jun 2006, 20:34

My most important uses of locks are:

Preventing two threads from picking the same move from the move list. When a thread wants to pick a new move to search, it first grabs the lock, then scans for the move in the move list with the highest move ordering score, picks this move, marks it as searched, and releases the lock again.
When a thread is finished searching a move at a split point, it grabs a lock before comparing the move score to alpha and beta and (if necessary) updating alpha, beta, the PV, and telling the other threads to stop (in case of a beta cutoff). The lock is released when all these tasks are done.

In both cases, the lock is local to a single split point. When there are several active split points, they all have their own lock object.

While writing this post, I realised that using two different locks at each split points could be worth trying. Perhaps I could improve performance a little by using one pick_move_lock and one update_pv_lock, instead of a single lock for both purposes.

Tord

by **mjlef** » 16 Jun 2006, 07:28

I have a question about threads too. I recently converted my chess program NOW to UCI, and in the process, used a thread to keep the program from hanging on input from the UCU interface. But for some reason, it seems to fail on fast Pentium 4 machines (Windows XP, service pack 2). It works perfectly on AMD chips. I do not have a Pentium 4 so cannot test on that, but I want to fix it anyway. Any suggestions or things to look for on P4s?

Mark

by **Tord Romstad** » 16 Jun 2006, 10:39

mjlef wrote:I have a question about threads too. I recently converted my chess program NOW to UCI, and in the process, used a thread to keep the program from hanging on input from the UCU interface. But for some reason, it seems to fail on fast Pentium 4 machines (Windows XP, service pack 2). It works perfectly on AMD chips. I do not have a Pentium 4 so cannot test on that, but I want to fix it anyway. Any suggestions or things to look for on P4s?

Hello Mark,

This probably isn't the kind of reply you were hoping for, but have you considered to use polling instead of using a separate input thread? Just poll for input every 10000 nodes or so, and process the input if there is any. The only commands you must be prepared to receive during search are "stop", "ponderhit" and "quit".

Perhaps polling is a slightly less elegant solution than a separate thread for input, but it is very easy to code, saves you from a lot of trouble.

Tord

by **mjlef** » 16 Jun 2006, 11:13

Tord,

I started off trying to poll. But Pascal seems to be missing any kind of real feof() function. Basically, it lies and says there is always input available, at least the way I have tried doing it. I suppose I could write a small C function that behaves better and link to that, then use polling. NOW was written long ago when threads were nice clothes you wore!

Mark

by **Jim Ablett** » 20 Jun 2006, 10:33

Latest Glaurung development version works fine compiled with Cygwin.

For all you testers who are having trouble compiling a working Win32 binary.

http://www.verzend.be/v/1098236/glaurun ... 2.zip.html

Jim.

by **Rémi Coulom** » 20 Jun 2006, 22:57

Hi Bob,

I took a look at your code as a source of inspiration when I parallelized my Go program. I think we are talking about this:

Code: Select all: static void __inline__ LockX86(volatile int *lock) { int dummy; asm __volatile__("1: movl $1, %0" "\n\t" " xchgl (%1), %0" "\n\t" " testl %0, %0" "\n\t" " jz 3f" "\n\t" "2: pause" "\n\t" " movl (%1), %0" "\n\t" " testl %0, %0" "\n\t" " jnz 2b" "\n\t" " jmp 1b" "\n\t" "3:" "\n\t" :"=&q"(dummy) :"q" (lock) :"cc"); }

In one of my tests for this piece of code, I found that it is buggy. It is necessary to add

Code: Select all: ,"memory"

after

Code: Select all: :"cc"

to tell the compiler that the code may alter the memory. Recent versions of gcc 4 are clever at optimizing memory accesses, and may cause problems if you don't tell that memory may change. There may be a way to tell that only *lock may change, but I did not find it in the documentation.

I think I did not keep the example use of your spinlock that caused the bug. But if you are really curious to see, tell me and I'll try to find it again.

I am surprised that the linux kernel could have such a bug. It might be worth telling the kernel people about this problem if their code is actually the same as in Crafty. If you lookup spinlock code with Google, you'll find that plenty of them have "cc", "memory" in them.

R?mi

by **bob** » 21 Jun 2006, 04:50

came right out of the kernel source, but it was circa 1996, which means it might have changed in the past 10 years as gcc was changed. I'll take a peek...

I will try to check my notes, but I think the __volatile__ part of the inline code said something like "do not move this code, do not optimize _around_ it, etc, back when this was written by the kernel guys... I just checked and the gcc docs say this still works. use the __volatile__ modifier if memory is modified and the memory value in question is not in the clobber list... So I don't see why this should fail, and unless there was a bug in gcc (which has only happened a few zillion times in the past 10 years) __volatile__ and "memory" ought to be doing the same thing...

by **Rémi Coulom** » 21 Jun 2006, 09:04

bob wrote:came right out of the kernel source, but it was circa 1996, which means it might have changed in the past 10 years as gcc was changed. I'll take a peek...

I will try to check my notes, but I think the __volatile__ part of the inline code said something like "do not move this code, do not optimize _around_ it, etc, back when this was written by the kernel guys... I just checked and the gcc docs say this still works. use the __volatile__ modifier if memory is modified and the memory value in question is not in the clobber list... So I don't see why this should fail, and unless there was a bug in gcc (which has only happened a few zillion times in the past 10 years) __volatile__ and "memory" ought to be doing the same thing...

Sorry, you are right, it was my fault. I had the unlock wrong. It looked too incredible that there could be a bug here.

By the way, there is something strange in your unlock function:

Code: Select all: static void __inline__ UnlockX86(volatile int *lock) { int dummy; asm __volatile__("movl $0, (%1)":"=&q"(dummy) :"q" (lock)); }

The dummy variable looks completely useless. It seems that

Code: Select all: static void __inline__ UnlockX86(volatile int *lock) { asm __volatile__("movl $0, (%0)": :"q" (lock)); }

works the same (could it save a register ?).

R?mi

Winboard Forum

Windows thread question

Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Re: Windows thread question

Who is online