Winboard Forum

by **Chan Rasjid** » 22 Sep 2005, 23:57

From what I have last seen , the important points will be (I am not doing an exhaustive analysis here) :
1) Data organisation and use - to maximise cache coherence.
2) Simple and tight loops - reduce branch mispredictions.
3) Nothing agressive - no dubious ideas that I can find offhand.
4) Essentially bugfree to a large extent (dont underestimate this fact).
5) Well tuned params (just from a cursory look) - nothing which were conflicting each other (this can be a major problem).
6) I did not see anything speculative in eval , but then I did not find it sufficiently sophisticated.
Does not need to be - the search takes it deep enough to play positionally sound chess : so lot of things I always toy around with are essentially absent.
7) this space for rent :-TAKEN - Fabien from what is said is sort of a (good) professional programmer so he knows how to write fast codes at his finger tips.

I think your 6-1/2 points almost completely explains all of fruit, there is no secret as he himself mentioned. Maybe just a little extra which others
do not yet understand (again just my guess).

I have 2 basic questions and very probably you can help me:-

1) branch misprediction - I just read a little about this and like to know
only from a programming point of view if my understanding is basically
ok.

The penalty is proportional to the clock cycles of the conditional expression :-

if (expr){
//code A
}else{

}..... cascading etc

it applies also to switch(expr) ?

If expr is heavy we have to be alert.
if expr is a simple logical test of (x) where x involves no memory fetch (say in register), then the penalty is minimal, even if block A is large.

2) Hash TT records - I read something like shifting _int64 has some penalties as our CPU register is not true 64 bit. Hashing need to be efficient as there may be potential cache thrashing or page swapping often.

[edited]
So if my hash record is 16 bytes, is it not better to use 1 _int64 hashkey + 2 int for others though we may have some unused bits here. ie data retrieval is through int shifts, not 64 bit shift.

Best regards
Rasjid

by **Uri Blass** » 23 Sep 2005, 00:43

Dann Corbit wrote:
Uri Blass wrote:{snip}
My division of the code to files is not good and I think that how to divide code to files is the first thing that I can learn from fruit(I may work slightly more about movei of today because I started to work on a function to generate only checks and did not finish it but basically I plan to start a new project when I may copy and paste part from the old project but the organization of files will be different) .

Uri

Your code is ugly, but for a person learning to program it is really quite amazing.

The division of files is not so important. The division of tasks into functions is pretty important and you do well enough there.

It might be worthwhile to start from scratch, and move your good movei ideas from the old program to the new one, one at a time.

You can look at really well written things (like Fruit) to get an outline of how to do it.

Just a thought.

But it will be really dissapointing for two or three months.

The main problem with Movei of today is that if I try to changes things in the search I often get bugs when I need a long time to find what is the problem.

Thanks to lack of debugging code like the code of fruit there were cases in the past when I changed something and it cause the program to crash and it took me a long time to discover what is the problem.

Of course I cannot make a fast progress in these conditions and the frustrating experience does not encourage me to continue to work on the code.

Uri

by **diepeveen** » 23 Sep 2005, 02:29

Uri Blass wrote:
mridul wrote:There is a difference between writing a program from scratch and a program which has already matured and optimised.

When progress becomes tougher , when you have eliminated all bugs you can find and cant find any new ones , when adding a new "feature" becomes more expensive than the benifits ... you can keep expirimenting with new ideas , but anything which gives you 1% improvement will be very carefully pursued.
This is common not just to chess but to most other s/w & h/w projects.

And dont take anything and everything stated in context of Fruit and Fabien.

I think that working a week for 1% improvement is not good idea even for Fruit WCCC

If you think that working a week for 1% improvement is the best thing that you can do and your program is not significantly better than Fruit WCCC then I think that it is probably not written well and it may be a better idea to start from scrartch.

Uri

[U-logics on]

I disagree.

For example when the chance your first move is the best move improves by 1%, then you search 1 ply deeper.

1 ply deeper for 1 week work is not bad.

[U-logics off]

by **Uri Blass** » 23 Sep 2005, 08:49

Vincent,
It is clear that the discussion was about speed improvement

If you look at the previous posts you can see that speed improvement was mentioned so improvement f 1% in order of moves is not relevant.

Uri Blass

by **mridul** » 24 Sep 2005, 10:51

Chan Rasjid wrote:
From what I have last seen , the important points will be (I am not doing an exhaustive analysis here) :
1) Data organisation and use - to maximise cache coherence.
2) Simple and tight loops - reduce branch mispredictions.
3) Nothing agressive - no dubious ideas that I can find offhand.
4) Essentially bugfree to a large extent (dont underestimate this fact).
5) Well tuned params (just from a cursory look) - nothing which were conflicting each other (this can be a major problem).
6) I did not see anything speculative in eval , but then I did not find it sufficiently sophisticated.
Does not need to be - the search takes it deep enough to play positionally sound chess : so lot of things I always toy around with are essentially absent.
7) this space for rent :-TAKEN - Fabien from what is said is sort of a (good) professional programmer so he knows how to write fast codes at his finger tips.

I think your 6-1/2 points almost completely explains all of fruit, there is no secret as he himself mentioned. Maybe just a little extra which others
do not yet understand (again just my guess).

I have 2 basic questions and very probably you can help me:-

1) branch misprediction - I just read a little about this and like to know
only from a programming point of view if my understanding is basically
ok.

The penalty is proportional to the clock cycles of the conditional expression :-

if (expr){
//code A
}else{

}..... cascading etc

it applies also to switch(expr) ?

If expr is heavy we have to be alert.
if expr is a simple logical test of (x) where x involves no memory fetch (say in register), then the penalty is minimal, even if block A is large.

I am not really an assembler expert like some here (Gerd , GCP , Vincent , etc) but will give my thoughts on this topic - they might be able to correct or add more to what I say.

if (cond){ ...} else if (cond1){ ...} else ...

This sort of structure is usually plain bad.
You have tonnes of branch mispredictions possible , lot of jmp's (if short jump's , then maybe the penality might not be that bad ? - but still pretty high penality) , screwing up instruction pipeline , etc.
If it is unavoidable - try to make sure that the first branch is the most commonly taken branch.
If the block size if small , it might become a jump within the pipeline - which is not that bad I think : I am a bit outdated on this ....

Now , a structure of the form

int var , a = SOMEA() , b = SOMEB();
if (cond){
var = a;
}
else{
var = b;
}

This could become optimised as a cmov (and variants) - so you might not "see" too much of a penalty here.
Compilers usually do this automatically (use appropriate arch flags).
Since the penality is much lower here , as much as possible try to reduce to this form.
The extra computation to calculate a , b might be usually lower than branch mispredictions.

Bottomline is - bpt's are very valuable resource in the processor , conserve them
But please dont go overboard unless you have tested your changes properly

Modern compilers can do miracles to unoptimised code - if you really want to tweak code to get the final few cycles out , you will have to roll up your sleeves and do it yourself.
Also , if possible , avoid assembly and stick to C (from my expierence) - it becomes a nightmare to keep it optimised as processors change : or you have to be Gerd

Chan Rasjid wrote:2) Hash TT records - I read something like shifting _int64 has some penalties as our CPU register is not true 64 bit. Hashing need to be efficient as there may be potential cache thrashing or page swapping often.
[edited]
So if my hash record is 16 bytes, is it not better to use 1 _int64 hashkey + 2 int for others though we may have some unused bits here. ie data retrieval is through int shifts, not 64 bit shift.

Best regards
Rasjid

Shifting has penalities in general on some cpu's (2 cycle op) and shifting of 64 bit entities : which are usually just emulated by the compiler) is much more expensive. (get msb's shifted from lower to upper 32 bits , etc (for << , inverse for >>) - yuck.

What I do is this :

typedef struct uint64_tag{
unsigned int bits[2];
}uint64;

(I am on a 32 bit box).
And just use the lsb32 and msb32 bits seperately.
When I encode info for hashtable (for example) - i make sure that the encoding is such that it never overlaps both.

I do have shift left and shift right operations defined but never use it. (Just there for completness sake and for debugging).

Another thing to note with hashtable is that you can make use of multiple probes with almost no extra cost.
Example - support you ask for hash_table[1234].
And assume each "entry" in your hash_table is 16 bytes long (8 bytes hash , 8 bytes data : I have it as unsigned int * , but consider a 16 byte entitity as a "block").
Now if your cache line is 64 bytes long , this will mean that a probe will return 64 bytes - so other than the penality for opening the memory at 1234 , you will have no subsequent penalities for probing 1235 , 1236 , 1237 .... only hit L2 and not memory.

This is the reason why some (lot ?) people use multiple probes instead of two different probes into same (or different) tables.

If you actually profile code , you will see that most people lose a lot - real lot to cache misses.

This post became a bit too longer than I intended

Regards,
Mridul

by **Gerd Isenberg** » 24 Sep 2005, 17:30

mridul wrote:...
Bottomline is - bpt's are very valuable resource in the processor , conserve them
But please dont go overboard unless you have tested your changes properly
Modern compilers can do miracles to unoptimised code - if you really want to tweak code to get the final few cycles out , you will have to roll up your sleeves and do it yourself.
Also , if possible , avoid assembly and stick to C (from my expierence) - it becomes a nightmare to keep it optimised as processors change : or you have to be Gerd
...

Hi Mridul,

thanks for the compliment, but _the_ assembler guru in chess programming is Frans Morsch, because his fritz/Quest-enginge was/is completely in assembly language.

I also had experience with a pure x86 assembler (TASM) chess engine, my former Dos-IsiChess (94-99) with a nice C++ gui. It was very hard to maintain, specially the evaluation. At that time compilers were not that smart in optimization, and it was much easier to outperform them with assembly.

In my recent 32-bit program i had only some inline assembly for bitscan (meanwhile replaced by x &-x bit-extraction and folded De Bruijn multiplication in C) and MMX/XMM-fillstuff.

The bit64*byte64 SSE2-dot-product i use is already C with assembly like intrinsics - 64-bit msc in mind, where inline assembly is not longer supported. And yes, inline assembly, specially small inlined functions have a relative huge overhead and may confuse compilers optimization abilities. IMHO inline assembly with gp-instructions doesn't make so much sense any longer, for x86-64 special instructions like bsf/bsr are supported by intrinsics.

While 32-bit compiler are really smart nowadays, i think compiler targeting x86-64 have still a long way to go, despite PGO.

I don't like the new x86-64 calling convention so much, where the caller has to reserve stackspace to save four 64-registers in the callee, even if not needed by a leaf-function. May be i don't understand the advantages or requirements of that calling convention so far, IMHO it thwarts the default fastcall convention a bit. Also, after some first experience with msc2005 i have the feeling that register usage (R08-R15) is rather conserative.

So may be for a while the theoritical gap between X86-64 assembler and C/C++ becomes a bit wider in favour to masm64. Practically you have to be Frans.

On the thread topics here - some redundant data is fine to get consistent data structures and most local memory access according to algorithms.

Branches - yes, we have enough in a chess program - but are the conditional branches really so random that todays (AMD64) very sophisticated branch prediction heuristics can not deal with them?

In abs/max/min a jxx is really faster than the bitfrickling variants, if predicted correctly most of the time. Profiler may help to find hotspots here and there and to use cmovxx or SSE2/3 integer instructions like pmax, pmin etc. And of course the more branches you have the more the branch target buffer becomes possibly trashed. So it might be a good idea not to inline often used routines with "random" branches.

I think Fabien's way to handle eval avoiding discontinuous conditional terms like

Code: Select all: if (someEvalFeature) eval += someBonus;

is probably a good idea - both avoiding branches, and based on the complexity and defectiveness (too general) of the condition, avoiding "wrong" discontinuous evaluation and search instabilities.

Your 64-bit struct is fine, but changing it some day to a scalar 64-bit type requires some work, specially if you are bitboard ;-)

I aggree with your hash-access paradigm. As we learned some time ago in the huge CCC latency-thread, with Vincent and Bob as it's best, accessing huge tables don't only suffer from memory latency but also from mapping virtual to physical memory.

This post became a bit longer too than intended ;-)

Cheers,
Gerd

by **Chan Rasjid** » 25 Sep 2005, 04:21

Thanks Mridul,Gerd for the further clarifications and ideas.

I'll see how my new nps will be when the basic changes are completed.

Best Regards
Rasjid

Winboard Forum

double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Re: double memory of data in fruit

Who is online