Winboard Forum

by **Daniel Shawul** » 25 Mar 2007, 08:03

I like using hashtables here and there in my engine.
main transposition table (not probed in qsearch), eval cache, pawn cache, king-pawn cache etc. However,aside from the main tt, i am not sure of the benefit of the others. Since all the others do not cutoff trees, the hashtable look up should cost less than the calcualtion done. How many clock cylcles does a one byte look from memory cost?
The only reason i have some of those tables is that they might become beneficial as more and more knowlede(calculation) is added. Currently they don't hurt/benefit much.

Daniel

by **Pedro Castro** » 25 Mar 2007, 17:26

Hi Daniel,

During long time I did not find that the eval cache worked in my program, it was more expensive to consult the eval cache that to examine the evaluation, then my evaluation was 20k, with the passage of time the evaluation has grown to the double approximately 40k, verifying the eval cahe now after lazy eval I obtain a 6% of speed.

With pawn cache I also had problems so that she worked, HG. Muller suggested to me to use a table of 256K so that she was in the memory cache of the system, my evaluation of the structure of pawns is very simple, in the end I obtained a speed of a 1-2%, it is not much, perhaps but as you say would allow to add knowledge and then I would have a better result.

by **H.G.Muller** » 26 Mar 2007, 10:02

Memory is accessed by the CPU in chunks of 64 bytes (a 'cache line'). Even if you only access a single byte, it fetches the entire cache line (and writes back the cache line that had to be thrown out of the cache to make room for it, if it was changed).

On my machine (a 1.3GHz Celeron M) a cache-line fill takes 286 CPU clocks. The faster you CPU, the more clocks it will take, as the memory does not care about CPU speed and remains just as slow.

On AMD machines you might expect it to be a bit faster, as the DRAM there is connected directly to the CPU, rather than through the North Bridge of the chip set.

by **Daniel Shawul** » 27 Mar 2007, 09:33

My eval cache is worth it, it speeds it up by 20% i think. The pawn cache is only slightly beneficial, while the king cache slows it down.
From H.G's figure i guess that i need to add instructions more than 286 clocks to see any benefit from it.
I am using a two-slot depth/always replace transposition table, but i guess that I can use 4 slots as the tt entry size is 16 bytes.
How about 8 or 16 probes? Has any one compared multiple probes with other replacement schemes.

regards,
Daniel

by **H.G.Muller** » 27 Mar 2007, 14:15

I have systematically tried many replacement schemes, and observed their performance as a function of hash-table size and tree size. I did do this on perft trees, though, as for search the results tend much more noisy: In search hash hits might affect what you want to search, so that sometimes an extra deep hit with a score that is worse when you would have searched at the nominal depth will cause you to switch to another PV. And then later, when all branches are searched to this depth, switch back to the original one. Thus having the hash hit leads actually to a longer search time then when you would not have had it. On perft trees you don't have such problems, every hash hit prunes that branch of the tree, and that's it.

I am still working on compiling the collected data in a useful form, to publish them on my website. (But I am working on so many things at once...

)

The preliminary conclusion, however, was that most schemes perform within 20% from each other. The only ones significantly worse (needing about twice larger hash for the same performance) were pure always-replace or pure depth-preferred. Shallowest-of-four was slightly better than shallowest-of-two. Shallowest-of-seven was significantly worse than either, despite the fact that all seven were in the same cache line! (I used 9-byte entries, cache-aligned.)

Equi-distributed draft seemed to perform best. ( http://www.talkchess.com/forum/viewtopi ... 44&start=2 ) To adapt it to the 7-entry bin size, I used 1 replace-always entry per 6 depth-preferred, and try only entry n, n+1 and the depth-preferred entry for any position (where n = 0-4 is derived from the hash key).

by **Daniel Shawul** » 27 Mar 2007, 14:52

I tried four slots and it perfomed worse in games. In positions like fine #70 where you really dont want to loose important information, it perfomed better.
The replacement scheme was replace the shallowest entry (where the age, number of moves made before this move, is also added to the search depth). With this scheme it reached much higher depth with very good score at fine #70. But it didnt perform that well in games. I have make sure that the alighment for is 16bytes when i did that test. Do i have to make sure that the address returned by malloc is divisible by 16 to make sure that all the 4 slots are in cache? One more question if I access which is not necessarily the start of the tt entry, will the cache line start at that byte? excuse my ignorance on this matter.

My opinion is that saving one more important position might be better than worrying about latency in long searches where the hashtable is full and replacements are a must.

How come your entries are 9 bytes only?
Please do update your page with the stuff that you tried.

Daniel

by **H.G.Muller** » 27 Mar 2007, 17:00

You will have to make sure the address is divisable by 64, not 16!

The cache lines always start at a 64-byte boundary. If you access a byte in the middle, the hardware fetches the word where this byte is in first (to minimize latency), but is then committed to fetch the other 7 words before it can do any other memory access. If your program accesses other bytes in the same cache line, they thus add some more latency. IIRC the order in which the CPU fetches the memory words is like when you XOR the initial address with a binary counter that counts from 0 to 7.

You might adapt the memory layout of your hash bucket to the fetch order: make sure the first two words fetched contain the hash locks, so that if you have a miss, the program does not have to wait for the remaining 6 words. (This does not save too much, as the largest part of the time is used for setting up the transfer, and once it is running, the words are transferred at a rate of one per busclock. On a 1.6GHz Celeron M with 400MHz FSB that is one word per 4 CPU clocks, so you could save 24 clocks.

My hash entry contains:

Code: Select all: int Lock; // 32-bits of the hash key not used for index char From; // From square, + 2 bits to indicate bound type char To; // To Square + 2 bits to indicate special moves short int Score; // 16 bit char Depth; // Draft of the entry

Each of those are arrays of 7 elements in a hash bucket. The 64th byte in the bucket is used as flags (1 bit per entry) that is set if the entry gives a probe hit or is used for replacement. After a search I purge all entries for which the flag is not set from the table.

(I am not happy yet about this latter overhead; in the future I want to integrate the purging with the probing, by letting the flag indicate if the entry belongs to the current search or not, alternately clearing or setting it. The low iterations of the search would then mark all entries still in use as belonging to the current search, until (say) half of the entries in the table is so marked, (During this time it would use the normal depth-preferred replacement irrespective of flaag setting), while the remaining part of the search would then consider all unmarked entries as empty, and preferrably use those for replacements.)

by **Daniel Shawul** » 27 Mar 2007, 18:33

Okay i will try 64 byte (i made a typo in my previous post) alignment and do the test again.

This is my hashtable entry.

struct HASH_ENTRY {
uint64 lock;
int32 move;
int16 score;
uint8 depth;
uint8 flag;
}

When probing the first byte to be accessed is the lock, so no problems i guess. But when storing the depth is checked first, so it means (score,depth & flag) are in cache but not lock & move. This will require additional memory look up?

Looking at your tt entries, I see that you store only 32 bits for the lock. Dont remever why I store the whole 64 bit

May be because it is what i got from Bruce moreland's tutorial.. In your case i guess you use the upper or lower half as a key right?

I think you should have the 'age' counter in your entry. Or otherwise you have to clear the transpostion table after a move?
Testing on a set of testpositons really doesnt measure the performance of a replacement scheme IMO. However i do not think that every transpositon table entry from an old search is unimportant. In the most common hash table, the prefer depth entry is replaced when the depth is greater, or when the position is from previous search. I dont totally agree of the later, because the search might be from an immediate former search and might have a large depth.

by **H.G.Muller** » 27 Mar 2007, 23:02

If you acces one byte, all 64 are read from memory in a burst transfer. As long as it stays in cache, you won't need any new memory accesses, for accessing this entry or any of the neighboring entries in the same cache line.

For this reason it is better to probe n and n^1 than n and n+1, as n+1 might be in the next cache line, while n^1 is always in the same (if your table start is 64-byte aligned)

The part of the key that you use to derive the index from, doesn't have to be stored, as it always matches. (If it doesn't, you do not get tho this entry.) Joker's index is only 22 bits, though (4M buckets of 7 entries), and a few other bits determine where within the bucket I probe. But there are some bits of my 64-bit key unused, so effectively I have only a 56-bit key. So I do have a key collision every few hours of search. But these almost never give a score that propagates to the root. (It must be on or very close to the PV to do that, and a negligible fraction of the tree nodes does that.) So I figure that I benifit more from having extra entries than I suffer from the occasional collision.

Well, the 1-bit flag serves as an age counter. I could reduce the depth to 7 bits, and make it a two bit counter. (Depths above 128 ply do not seem very useful.) You mut realize that my replacement already overwrites depth-prefferred entries with lower depths during the same search, if there are too many of that depth (draft). So their is no great need to clear the TT, most entries will disappear soon enough. Only those very close to the root, that have very high draft, will never become so plentiful that they are replaced by lower drafts. To prevent that the obsolete amongst those would finally acuumulate until they fill upto 20% of the table (when the usual mechanism would start to overwrite them), I do the marking for deletion. If an entry close to the root is not visited in a search, (in all iterations), I think it is save to assume that it really has become unreachable (because of irreversible move).

by **Daniel Shawul** » 28 Mar 2007, 07:44

If you acces one byte, all 64 are read from memory in a burst transfer. As long as it stays in cache, you won't need any new memory accesses, for accessing this entry or any of the neighboring entries in the same cache line.

Okay i understand now. I assumed word was 32 bytes which i now understand that you meant the cache line word (64 bytes).
So in summary as long as the tt entries start at locations divisible by 64, all the 4 entries will be read. Thanks for the explanation.

You must realize that my replacement already overwrites depth-prefferred entries with lower depths during the same search, if there are too many of that depth (draft).

For tests with other replacement schemes you might need more number of bits though. I currently use 5 bits for age, so it needs at least 16 moves to deplete it. The idea to clear older entries when an irreversible move is made looks interesting.

I think that using different replacement schemes for deeper and shallower searches may be a good idea. It is generally good to hang on to higher drafts neer the root , and an always replace scheme for shallower drafts say <= 5. I have seen many cases where my engine gets a position where the pawns are rammed against each other. Now in this positons it slowly looses to the opponent which i think is because of the replacement scheme. I think in these kind of positions depth prefered is good (fine #70 is good example). But for normal middlegame positions leaning towards the always replace scheme might be a good idea.

by **H.G.Muller** » 28 Mar 2007, 09:19

Yes, sorry for the confusion. I was using the term 'word' here in the memory sense (the group of bits simultaneously fetched, i.e. the data-bus width), which is 64 bits on any Pentium. This is different from the Intel terminology for CPUs, where 'word' even means 16 bits (legacy from the 8086 days), and they talk about double words and quadwords for 32 and 64-bit data types.

Always-replace suffers from the opposite problem as prefer-depth: The table gets completely flooded by depth=0 entries (or whatever the lowest draft you store in the table). Theoretical considerations show that it is optimal to have an equal number of entries for any draft, and keep the the most recently stored ones of any draft (within one search). This because a tree walk is such that you quickly visit the first few instances of a certain position (by transposing the moves just before it was reached), but that it takes a very long time for the later transpositions to be visited (this requires transpositions with moves near the root). So the hit rate on older entries goes down rapidly with time, and it would be better to throw them out to accomodate more fresher entries, even if they are less deep.

I have only done tests for a single search, in the limit where the tree is much larger than the table. (If this is not the case, replacement is hardly an issue.) As my conclusion was that it is better to quickly replace even deep entries (except for a few very deep ones, that occupy negligible part of the table), I don't really have to worry about aging. Most entries do not survive the search that created them, so there really is nothing to age.

by **Robert Pope** » 29 Mar 2007, 17:18

What do you do to make sure your hash table is aligned properly in memory? Is there a compiler switch option or is it part of the program?

by **H.G.Muller** » 29 Mar 2007, 22:06

I allocate the hash-table dynamically now, through malloc(). If I need SIZE bytes, I ask for 64 extra. Then I correct the pointer by hand

Code: Select all: int i = (int) malloc(SIZE+64); i = i+63 & ~63; HashTable = (struct _HashEntry *) i;

If you really want the HashTable not to be a pointer, but a global array, you have to resort to even dirtier tricks:

Code: Select all: char Table[SIZE*sizeof(struct _HashEntry)+64]; #define HashTable ((struct _HashEntry *) (Table+CONSTANT)) ... if(((int) HashTable) & 63) { printf("error, hash start = %x\n",(int) HashTable); exit(0); }

where I adapted the CONSTANT by hand and recompile until the test during startup no longer aborts the program.

by **Onno Garms** » 28 May 2007, 16:13

H.G.Muller wrote:On my machine (a 1.3GHz Celeron M) a cache-line fill takes 286 CPU clocks.

How did you determine this value? Can you also tell how to determine the costs of L1 misses?

by **H.G.Muller** » 28 May 2007, 17:28

I just made a small program that declared an array of char bigger than L2, and then I time how long it takes to do a billion accesses to it in strides of 64 (the length of the cache line), wrapping around. Be sure to write something that the optimizer cannot optimize away. E.g.:

Code: Select all: #define SIZE (1<<24) /* 16MB */ char mem[SIZE]; int sum; main() { clock_t t; int i, p; t = clock(); for(i=1e9; i>0; i--) { sum += mem[p&(SIZE-1)]; mem[p&(SIZE-1)] = i; /* make dirty; delete for clean timing */ p += 64; } t = clock() - t; ... }

You can experiment with different array sizes and strides. If you take a size that fits in L2, but not in L1, you will get the L2 access time. You cannot measure the L1 access time this way, as it is so short that the execution of the loop will be the limiting factor.

by **Onno Garms** » 28 May 2007, 19:39

Thank you.

Without the line "make dirty", acording to "top" (and acording to runtimes), my compiler (or OS?) seems to optimize the memory allocation away, even when optimization is disabled. So I added some writing before starting the timing.

Also, running times vary quite a lot, e.g. in two runs with SIZE 19, I got 39s and 48s for 10^10 iterations. There are no other time-consuming processes running. Any idea why this happens?

Other results are also strange. I got the following:

Code: Select all: SIZE meaning read+write read 13 CPU+L1 4.9 1.9 19 above+L2 5.5 3.8 28 above+mem 34.2 15.7

This gives 0.6 sec for 10^9 read+write in L1, but 1.9 sec just for reading. Am I getting something wrong?

Above gives me for 10^9 accesses and 2.8 GHz CPU speed (i.e. 10^9 cycles in 0,36 sec):
L2 r+w: 0.6 sec = 1-2 cycles
L2 r: 1.9 sec = 5-6 cycles
Mem r+w: 29.3 sec = 82 cycles
Mem r: 13.8 sec = 39 cycles

In think, something must be wrong, but I cannot tell what.

by **H.G.Muller** » 03 Jun 2007, 14:32

I don't understand your numbers, what you mention in the text doesn't seem to correspond to your table.

If you would time 2.8e9 writes, seconds would directly correspond to cycles, which would make life easier.

Printing sum at the end might help to silence smart-ass optimizers. It might be a good idea to pre-fill the memory array with something, before starting the timing test. Especially Linux sometimes behaves funny when reading uninitialized data (it is all aliased to the same physical 4K RAM page).

Your RAM access seems very fast. I guess we are timing fast page-mode access. Perhaps it would be more representative to take strides of 4096+64, rather than 64 bytes, so that every access falls into a new DRAM page.

by **Onno Garms** » 03 Jun 2007, 16:50

H.G.Muller wrote:I don't understand your numbers, what you mention in the text doesn't seem to correspond to your table.

In the text, I tried to compute RAM access times by subtracting the program runtime without L2/RAM access (SIZE=13) from the runtime with L2/RAM access (SIZE=19 / SIZE=28). Is that the correct calculation to do?

If you would time 2.8e9 writes, seconds would directly correspond to cycles, which would make life easier.

Converting seconds into cycles it not a big deal. Seconds * 2.8 = cycles.

Printing sum at the end might help to silence smart-ass optimizers.

Sometimes. It does not help against optimizing the memory allocation away if there are no writes.

Here is the full test program that I used for the table above.

Code: Select all: #include <iostream> #include <ctime> #define SIZE (1<<28) char mem[SIZE]; int sum; main() { int p; for(p=0; p<SIZE; p+=64) mem[p] = p; clock_t t = clock(); for(long long int i=1000000000ULL; i>0; i--) { sum += mem[p&(SIZE-1)]; // mem[p&(SIZE-1)] = i; p += 64; } t = clock() - t; std::cout << "Time: " << t/1000000.0 << std::endl; std::cout << "sum: " << sum << std::endl; }

Commenting out the initial write reduces the runtime from 15s to 4s, makes the program print out "sum: 0", and reduces its memory usage according to "top" from 25% (i.e. 256 MB) to almost nothing.

It might be a good idea to pre-fill the memory array with something, before starting the timing test. Especially Linux sometimes behaves funny when reading uninitialized data (it is all aliased to the same physical 4K RAM page).

That's what I did. Thank you for the explanation (which I didn't know before).

Your RAM access seems very fast. I guess we are timing fast page-mode access. Perhaps it would be more representative to take strides of 4096+64, rather than 64 bytes, so that every access falls into a new DRAM page.

Much more believable (i.e. worse) and more reproducable results. For 2.8e9 iterations:

Code: Select all: SIZE r r+w 13 5.36 13.6 18 7.39 15.5 19 31.6 34.6 28 114.9 179.9

The only thing I still don't understand is why SIZE=19 is so much slower than SIZE=18. I have a megabyte of L2.

Your postings have been a big help for me. Thank you.

by **H.G.Muller** » 03 Jun 2007, 20:28

you should not correct for the time of the loop (or L1 access) itself, as computation and waiting for memory access proceed in parallel. So as long as the loop with cache hits is faster than the measured time, the measured time is all memory access, while in parallel the CPU is executing the instructions for a few accesses ahead, and stalled the rest of the time.

The difference between 256KB and 512KB might be caused by TLB misses. I don't know which CPU you have, or how many TLB entries there are on that machine for 4KB pages. Although when stepping in strides of 64 this would be an unlikely explanation, as you do many accesses to the same page, and all except possibly the first should be TLB hits.

by **Dann Corbit** » 04 Jun 2007, 20:08

Daniel Shawul wrote:Okay i will try 64 byte (i made a typo in my previous post) alignment and do the test again.

This is my hashtable entry.

struct HASH_ENTRY {
uint64 lock;
int32 move;
int16 score;
uint8 depth;
uint8 flag;
}

When probing the first byte to be accessed is the lock, so no problems i guess. But when storing the depth is checked first, so it means (score,depth & flag) are in cache but not lock & move. This will require additional memory look up?

Looking at your tt entries, I see that you store only 32 bits for the lock. Dont remever why I store the whole 64 bit May be because it is what i got from Bruce moreland's tutorial.. In your case i guess you use the upper or lower half as a key right?

I think you should have the 'age' counter in your entry. Or otherwise you have to clear the transpostion table after a move?
Testing on a set of testpositons really doesnt measure the performance of a replacement scheme IMO. However i do not think that every transpositon table entry from an old search is unimportant. In the most common hash table, the prefer depth entry is replaced when the depth is greater, or when the position is from previous search. I dont totally agree of the later, because the search might be from an immediate former search and might have a large depth.

There is no reason to store the entire hash as a cross-check.
The bottom bits are already matched by definition if you use a modulus or &-mask to determine the hash position. Therefore, the only bits we do not know about are the bits above the top of the hash lookup code itself.

Suppose that you use the top 32 bits as a cross check to be sure that your board position is identical to the one in hash.
Suppose further that you have 2 million hash entries, and so your hash table address is 21 bits wide. That gives 21 bits + 32 bits = 53 bits of confirmation. Now, it is possible that some of those other 11 bits did not match, but very unlikely. It is so improbable, that the damage caused by a faulty hash lookup is so rare that it will not effect the search in practical terms. So hash tables with the full 64 bits are very rare. I have seen some chess engines that dispose of the cross check altogether, but that seems a bad idea if your hash table is only (for instance) 21 bits wide. In such a case, I am sure that there will definitely be real problems in long searches.

Winboard Forum

cost of memory lookup

cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Re: cost of memory lookup

Who is online