Chan Rasjid wrote:From what I have last seen , the important points will be (I am not doing an exhaustive analysis here) :
1) Data organisation and use - to maximise cache coherence.
2) Simple and tight loops - reduce branch mispredictions.
3) Nothing agressive - no dubious ideas that I can find offhand.
4) Essentially bugfree to a large extent (dont underestimate this fact).
5) Well tuned params (just from a cursory look) - nothing which were conflicting each other (this can be a major problem).
6) I did not see anything speculative in eval , but then I did not find it sufficiently sophisticated.
Does not need to be - the search takes it deep enough to play positionally sound chess : so lot of things I always toy around with are essentially absent.
7) this space for rent :-TAKEN - Fabien from what is said is sort of a (good) professional programmer so he knows how to write fast codes at his finger tips.
I think your 6-1/2 points almost completely explains all of fruit, there is no secret as he himself mentioned. Maybe just a little extra which others
do not yet understand (again just my guess).
I have 2 basic questions and very probably you can help me:-
1) branch misprediction - I just read a little about this and like to know
only from a programming point of view if my understanding is basically
ok.
The penalty is proportional to the clock cycles of the conditional expression :-
if (expr){
//code A
}else{
}..... cascading etc
it applies also to switch(expr) ?
If expr is heavy we have to be alert.
if expr is a simple logical test of (x) where x involves no memory fetch (say in register), then the penalty is minimal, even if block A is large.
I am not really an assembler expert like some here (Gerd , GCP , Vincent , etc) but will give my thoughts on this topic - they might be able to correct or add more to what I say.
if (cond){ ...} else if (cond1){ ...} else ...
This sort of structure is usually plain bad.
You have tonnes of branch mispredictions possible , lot of jmp's (if short jump's , then maybe the penality might not be that bad ? - but still pretty high penality) , screwing up instruction pipeline , etc.
If it is unavoidable - try to make sure that the first branch is the most commonly taken branch.
If the block size if small , it might become a jump within the pipeline - which is not that bad I think : I am a bit outdated on this ....
Now , a structure of the form
int var , a = SOMEA() , b = SOMEB();
if (cond){
var = a;
}
else{
var = b;
}
This could become optimised as a cmov (and variants) - so you might not "see" too much of a penalty here.
Compilers usually do this automatically (use appropriate arch flags).
Since the penality is much lower here , as much as possible try to reduce to this form.
The extra computation to calculate a , b might be usually lower than branch mispredictions.
Bottomline is - bpt's are very valuable resource in the processor , conserve them
But please dont go overboard unless you have tested your changes properly
Modern compilers can do miracles to unoptimised code - if you really want to tweak code to get the final few cycles out , you will have to roll up your sleeves and do it yourself.
Also , if possible , avoid assembly and stick to C (from my expierence) - it becomes a nightmare to keep it optimised as processors change : or you have to be Gerd
Chan Rasjid wrote:2) Hash TT records - I read something like shifting _int64 has some penalties as our CPU register is not true 64 bit. Hashing need to be efficient as there may be potential cache thrashing or page swapping often.
[edited]
So if my hash record is 16 bytes, is it not better to use 1 _int64 hashkey + 2 int for others though we may have some unused bits here. ie data retrieval is through int shifts, not 64 bit shift.
Best regards
Rasjid
Shifting has penalities in general on some cpu's (2 cycle op) and shifting of 64 bit entities : which are usually just emulated by the compiler) is much more expensive. (get msb's shifted from lower to upper 32 bits , etc (for << , inverse for >>) - yuck.
What I do is this :
typedef struct uint64_tag{
unsigned int bits[2];
}uint64;
(I am on a 32 bit box).
And just use the lsb32 and msb32 bits seperately.
When I encode info for hashtable (for example) - i make sure that the encoding is such that it never overlaps both.
I do have shift left and shift right operations defined but never use it. (Just there for completness sake and for debugging).
Another thing to note with hashtable is that you can make use of multiple probes with almost no extra cost.
Example - support you ask for hash_table[1234].
And assume each "entry" in your hash_table is 16 bytes long (8 bytes hash , 8 bytes data : I have it as unsigned int * , but consider a 16 byte entitity as a "block").
Now if your cache line is 64 bytes long , this will mean that a probe will return 64 bytes - so other than the penality for opening the memory at 1234 , you will have no subsequent penalities for probing 1235 , 1236 , 1237 .... only hit L2 and not memory.
This is the reason why some (lot ?) people use multiple probes instead of two different probes into same (or different) tables.
If you actually profile code , you will see that most people lose a lot - real lot to cache misses.
This post became a bit too longer than I intended
Regards,
Mridul