Winboard Forum

Posted: **10 Dec 2004, 06:11**

Hi
I used the well known techinique of comparison of node counts between
debug and release versions to check . But that doesn't help for
a parallel version of danchess because the node count varies on different executions of the same exe. Are there any good debuging methods for a parallel engine?
Debug version crashes on some positions of wac but the release
works fine in analyzing wac,ecm and other testsuites. How can i reproduce these crashes in the releae version?
thanks in advance
daniel

Posted: **10 Dec 2004, 07:06**

Compile with several different compilers on the highest warning levels.

Use PC-Lint on the code base.

Try BoundsChecker or ElectricFence or some similar tool set for run time error trapping.

When the debug version crashes, what do you see in the debugger?

For sure, something is very wrong.

Posted: **10 Dec 2004, 09:11**

Hi Dann

The debug version error seems because of some illegal move
being made some where. But i have disabled hashtable move,killer moves and other thigns which could produce illegal moves and still they appear. The release version doesn't not make illegal moves with all the above things on. I am sure my move gernerator doesn't generate illegal moves.

Tried hard for 2 days to solve the problem.
I have inserted assertions everywhere but i am now becoming more convinced it is not something like a logical error . For example if a crash
occurs at wac 23, then when i analyzed that position again it will work correctly. Next time i run it will crash at different position.
May be the release will crash at a different hardware

I will try to use the things you mentioned

thanks
daniel

Posted: **10 Dec 2004, 17:45**

First, allow me to say:

I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO I TOLD YOU SO

More seriously, now.

First, it sounds like you might have a memory corruption / unitialized error, so I recommend do as Dann said you run under Valgrind or some windows equivalent.

If that doesn't work, you have some sort of race condition with one thread overwriting another thread's memory (the reason Vincent likes multiprocess parallelism vs multithread parallelism: it does remove a lot of idiot bugs). For that the only thing I can suggest is making a big logfile that shows EVERYTHING the parallel code does. "I split here (ply=3 Nb4 Rc8 exd5)." CPU got to the split node now. CPU 1 searching X. CPU2 searching X.

Anyway, I'm sure you'll get things working eventually. Just don't give up!

anthony

Posted: **10 Dec 2004, 19:53**

Daniel Shawul wrote:Hi Dann

The debug version error seems because of some illegal move
being made some where. But i have disabled hashtable move,killer moves and other thigns which could produce illegal moves and still they appear. The release version doesn't not make illegal moves with all the above things on. I am sure my move gernerator doesn't generate illegal moves.

The illegal moves can come from many possibilities:
1. Memory over-write
2. Uninitialized variables
3. Other undefined behavior (possibilities are numerous)
4. Problem in makemove
5. Problem in unmakemove
6. Problem in parallel operation (two threads update the same object producing unexpected results)

Many other things.

Does the problem show up only when using multiple threads? If yes, that should be a big clue.

Posted: **11 Dec 2004, 06:11**

The error only occurs when using multiple threads.
One "important" clue i found is that if i wait for the thread to
finish searching before going to assign another helper thread
the problems disapear. I know this is not really searching in parallel
but,this makes me sure that the problem is due to two threads searching
in parallel. I don't think this is due to synchronization problem. Only the split information needs to be locked/unlocked.
I am new to the idea "one thread overwriting another threads memory" which may cause the problem. Is there any way to prevent this with out
shifting to processes.
I will update the logfiles with the split info and see what i get.
daniel

P.S: Anthony note this is not due to design problem. Mine is simple
as you know, problem is my limited programming skill in this area.
I think you are going to win anyway

Posted: **11 Dec 2004, 21:18**

How do you divide moves among the processors? That would seem like a logical place to look.

In Zappa there is one global lock on the split information, but there is another lock when each CPU grabs a move from the move list. Tell me you put a lock on this

anthony

Posted: **11 Dec 2004, 23:20**

Hi Antony , Daniel,
Zappa uses global lock ?
Interesting - I assumed you locked on demand the required process or subtree (as your impl might be).
I never tried that direction (global lock) when I tried MP.
I had a bit too fine grained locking even for my comfort ...... but what the hell it worked well ! (forget the maintainability part though

)
Maybe we should compare notes someday soon

Daniel : I would suggest the follow areas to hunt for bugs related to bad moves from my expierence (I was MP , but still might help you too) :

If the boundschecker and valgrind dont work (well after a point of time , they did not work for me too - all data access will be to valid memory - though logically invalid locations !).

1) Split point : the move migt be removed or already made or n number of other reasons.
2) You are trying to split at a point which has already failed high/failed low in the parent thread/proc
3) If using any global data : check and double check.
4) Make sure your lock and unlock are atomic - source of some bugs for me

5) Any and every global data access should be atomic - also make sure that you are not doing anything funny with hashtable - like multiple probes to return the various data : do a local into stack copy before comparision of hash and rest.
Would also help to make sure that move returned from hashtable is valid (assuming you have a is_move_valid(...) function).

Another thing , dont disregard any warning (in compilation phase and nagging doubts in your mind) however stupid they seem - they could the cause. I ran with max warnings with all pragma's disabled to maximise all inputs from compiler. Also try both VC and g++ (-Wall option) : both sometimes report different set of warnings.

If race condition , then logging like Antony says might not help , but else it should surely help you debug. (Note : add fflush(logfile) after every write - to identify exact flow of instructions in the logfile).

Also , dumb question : if MT , you aer using -D_REENTRANT right ?

Hope this helps,
Regards
Mridul

Posted: **12 Dec 2004, 06:00**

1st:
I only assign a single move to a processor (not a set of moves as
most of you does). This doesn't need lock/unlock.
I know this is very slow but good for a start i guess.

for each of the moves
{
if(first move ||
no idle processor)
search normally()
else
search_with_thread()
}

Wait for helper threads to finish();

2nd :
The only change i made in compiler options when going parallel is
-\MLd to \MTd for the debug
-\ML to \MT for the realease
Should i do something more?
3rd:
warning level is set to level 3. But if i change it to level 4
hundreds of warinings appear. Even the nalimov code has hundreds of warnings at that level.

Thanks both of you for sharing your experiences!

daniel

Posted: **12 Dec 2004, 17:48**

Ignore the -D_REENTRANT thing, he is talking about unix.

My question is this. You are at a node. You have N moves to search (say N=35). How do you divide them up?

In Zappa there is a list that some processor generates in shared memory somewhere, and each CPU that is working at that node locks the node's lock, removes a move from the list, unlocks, and goes about its merry way.

Are you saying that you spawn a separate thread for each move, i.e.:

forall moves {

CreateThread; Give Move;

}

Wait_for_threads_to_finish;

??

anthony

Posted: **13 Dec 2004, 05:33**

Yes that's exactly what i do,except that
i don't create/destroy threads each time.
All my helper threads are in idle loop until a new search
is assigned to them.
This is slow because i have to copy data each time i search a single
move which is not the case if i did it for a group of moves.
A processor X after searching a single move of node N could be searching
a different nodes move next time. This is more flexible i guess. And if i device a way to detect it is searching the same node's (N's) move next time it is searching, copying is not needed.

daniel

Posted: **13 Dec 2004, 05:52**

Note all moves are searched in parallel if only there
are enough processors available. If not
the parent thread continues searching them. The first
pseudo code i posted is correct.

for each of the moves
{
if(first move ||
no idle processor)
search normally() //the parent
else
search_with_thread()
}
Wait for helper threads to finish();

[/quote]

Posted: **13 Dec 2004, 12:23**

Hi all,

I tried doing a 9 ply search from start position. The DEBUG version searched fewer nodes than the RELEASE version.

I am currently trying to switch off opt flags but so far I cant seem to pinpoint the culprit.

Any ideas what this means?

using vc 2005 express beta

Thank you.

Posted: **13 Dec 2004, 17:23**

Hi Daniel , David,

Under normal circumstances where you have different search procs picking up moves in parallel and searching , the possibility that you will get same node count between release and debug , and even between diff runs of same version is going to be quiet tough.

That said , the algorithm that you mention (Daniel) should not suffer from this - and you _should_ get same node count (Assuming I am not missing something critical here !). This can be a very vital debugging tool ! At a glance you see if there is something amiss.

I like Antony mentioned , I find debugging a multiprocessor version infinitely more easier than a multithreaded version.
Other than other benifits , specifically in our case , the interactions are limited to the shared mem variables.
So you know exactly were to concentrate looking for bugs in case there are any.
You dont need to try to break your head about all the other possible globals that are interacting (and much as I like - it is impossible to efficiently eliminate all globals).
That being said (maybe too late for you guys to change impl , or it is a design decision with other considerations I am not aware of , personal preferance , etc !) , let us try to analyze possible solutions to our problems.

@Daniel

for each of the moves
{
if(first move ||
no idle processor)
search normally()
else
search_with_thread()
}

Wait for helper threads to finish();

1) Just wondering , does search_with_thread() return only after search has already started in the new thread ?
Is this an active push to idle threads or do idle threads pull possible split points from some queue/list ?

2) How do you handle a fail high from the child thread ?

3) Hope you use lockless hashing/some locked hashing strategy !

4) Your move generation stack is consistent across processors right ?
(You need to have a per thread move generation stack or something like that)

5) Hope you are validating all cached/hashed moves before making them.

6) Since you have ruled possibility of race condition (there are n number of possible race conditions even in your algo , but we will come to that later) , then what Antony suggested is actually ideal for you.
Dump everything into a file , make a GUI based viewer of logfile and drill down - it is as simple as that.
The GUI might take some time writing - but trust me , you will always be happy with the 1 week you invested in it for the hundreds of bugfixing weeks it saved you ! (and please using some tree like UI , so that you can selectively go into or step out of subtree's , very intutive and easy to use and understand when you analyse logfile along with source code).

7) It helps in debugging if you support multiple platforms (assuming you dont use platform specific api , assembly , etc). Even when things run smooth in one platform (read platform here as triplet of HW/OS/compiler combination) , you will see probs in other or errors/warnings on other will make more sense , etc. So you might want to look into that direction also ...

8) -D_REENTRENT was for *nix : I think there is similar flags for windows (multithreaded dll or something in vc) : dont remember.

Hope this helps !
I got some really quality feedback and help when I wrote my MP version and I was ever grateful for it and it saved tonnes of my time and patience (which was running pretty towards middle) - I know how frustrating it can get , but dont ever give up !
The feeling is just too amazing when you see your MP version do 3.8x effective speedup on a quad

@David

Are you using a parallel engine ? If yes , then the counts need not be same on general principles ...
If no , then you have a bug(s)

Like I mentioned to Daniel , a search dump and analysis would be simple way to start : you have got some suggestion I see at CC which were general heurestics I can suggest to track this down

More info on what you observe would be great too !

Hope this helps,
Regards
Mridul

Posted: **13 Dec 2004, 17:47**

I'm still not understanding how you divide moves.

It appears you give each thread only 1 move? Do you then search with >N threads where N is the number of CPUs on the system (say 20 threads on a dual?) I never considered doing that, but it might work (although I'm sure you lose quite a bit to overhead).

or do you do:

spawn thread with one move;
search rest of the moves with this thread;

Which seems like it won't load balance very well.

Basically everyone does 1 thread/process per CPU, with spinlocks doing synchronization and a movelist in shared memory somewhere. You are doing something else, so we are all a little confused

Also, Zappa & Crafty both do threads, so it is possible. But it is a little trickier to debug some of the time.

anthony

Posted: **13 Dec 2004, 18:20**

Hi,

I think what Daniel does is something like this - correct me if I am wrong Daniel !.

He is similutating parallel search for alpha beta.
So what he does is :

If num_moves > 0 && you have some idle thread in the list/queue.
He will assign this move to be searched to that thread.
He repeats this at a node until there are no more idle threads
Then the parent continue to search the remaining num_moves - num_threads moves.
After parent completes , it waits for the children to complete , the children move to idle queue and parents move one ply up.
So parent is always the initiator of the alphabeta search and children will always be slaves.
It is more like alphabeta being brute forced into a parallel form from its serial nature.
Not terribly efficient , but you will get better speedup (hopefully) than most of the other ideas i have read about.
Maybe once he is done with this , he will try to move this to something more complex ???

This is my understanding of his algo and my earlier post and questions were based on this understanding ....

Regards
Mridul

Posted: **14 Dec 2004, 07:25**

Hi All

What mridul said is exactly what i do,i coundn't say it better :!:

The only difference with what most of you do is
i give only a single move to a processor.This has "flexibility" because
you don't have to wait until all of the moves that are assigned to the processor are finished. It decreases the idle time of processors i guess.
Note processor X could search two or >2 moves of a given node .
These moves could be searched one after the other (i.e processor X searches second move of the node right after it finishes the first move).
In this case there is an overhead because the processor has all the information for the second move but it is duplicating it. This i can avoid by retaining the node count and avoiding copying if the node count matches.But i don't think the overhead is significant so i will not do it.
Without this overhead , it is similar to what most of you does. Correct me if i am wrong.

For Anthony:

Basically everyone does 1 thread/process per CPU, with spinlocks doing synchronization and a movelist in shared memory somewhere. You are doing something else, so we are all a little confused

No it is all the same except moves are assigned one by one to a processor. Not a block of moves.This avoids the need for a shared memory movelist

For a dual only one helper thread is created (one is the main thread).
I can generate 20,but i won't, because it is slows down the search.

I think i have spotted the bug

But i still don't understand why

It has something to do with the wait_to_finish() function,nothing
related to the helper thread's search because it crashes with
this being disabled

I will report it when i completely understood what is happening.
But now let me answer some of mridul's questions.

1) Just wondering , does search_with_thread() return only after search has already started in the new thread ?
Is this an active push to idle threads or do idle threads pull possible split points from some queue/list ?

It is an active push. I wait until the search starts in the helper thread before i go on assigning others. But i don't really understand the need because once i trigger the search, the processor will start the search when it gets its own time slice. And clears a flag in the split information when it finishes.The main parent thread will wait until all the threads started and finished searching.

2) How do you handle a fail high from the child thread ?

I stop all the helper threads and return the score.At the moment i do that
only when the main thread is acting as a slave. I can do that when it occurs in the helper threads but it is more complicated for my case so i ignore it there.

3) Hope you use lockless hashing/some locked hashing strategy !

4) Your move generation stack is consistent across processors right ?
(You need to have a per thread move generation stack or something like that)

5) Hope you are validating all cached/hashed moves before making them.

I don't use lockless hashing, I validate the moves before i go on using them.
Threads have their own move stack.
To avoid the above problems i took the move stack inside the search,
and generate all the moves at the beginning of search.

Thanks very much for all your suggestions!
daniel

Posted: **14 Dec 2004, 15:13**

OK, let me see if I understand this. Suppose we are searching the initial position.

the Master processor searching e4
He notices he has 2 cpus in the idle queue
He assigns d4 to one, and c4 to the other
He begins searching Nf3, and the Nc3
He finishes, and notices he one again has ide CPUs
So he assigns b4 to one and f4 to the other.
etc.

anthony

Posted: **14 Dec 2004, 17:09**

After 4 days and 5 hours exactly the bug is
completely solved :!:

It was just simply deleting a 5 letter word "static".
Yes this was the head ache.
I defined a temporary SQUARE tempSQ variable in my board class
which i use to exchange squares during make/unmake.
When two threads are making a move at similarly the same time
tempSQ is corrupted. But this happens very rarely and that's why
it was so hard for me to spot it.
No legal moves were generated , everything was fine.

I don't know how boundscheker or anyother debugging tool could have
helped me here. Because in this bug all the moves were legal and >95% of the time make move is right. And bugs are not reproducible.

Any bug hunting story which beats this one? I bet not

greetings from very very happy
Daniel

Posted: **14 Dec 2004, 17:19**

Ofcourse the discussion was more than finding one bug.
Being a beginner i need to dig in to the concept of parallel
search in general.
In CCC archieves, I couldn't find parallel search discussions
which are important for beginners.
So i will continue asking questions , not necessarily bugs :wink:

Daniel

Winboard Forum

Debugging techniques

Debugging techniques

Some useful things

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Q. about wel known technique

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

Re: Debugging techniques

The bug of the century is solved!

Re: Debugging techniques