Winboard Forum

by **Alvaro Begue** » 09 Sep 2005, 00:36

Hi,

I am planning on adding parallel search capabilities to my engine (Ruy-L?pez). I have read articles about several algorithms and I feel confident I could program pretty much any of them (YBWC, APHID...).

Is there an algorithm that is the obvious choice I should start with?

What are other people using?

Thanks,
?lvaro.

by **Gian-Carlo Pascutto** » 09 Sep 2005, 00:58

DTS

If that's too complicated, use YBWC.

by **Dann Corbit** » 09 Sep 2005, 01:15

Here is a description of the DTS algorithm:
http://www.cis.uab.edu/info/faculty/hyatt/search.html

by **Alvaro Begue** » 09 Sep 2005, 04:20

Thank you very much! I'll start with DTS, then.

by **Daniel Shawul** » 09 Sep 2005, 13:11

wrong choice IMO.
I think you should start with simple PVsplitting.
daniel

by **Alvaro Begue** » 09 Sep 2005, 13:24

Daniel Shawul wrote:wrong choice IMO.
I think you should start with simple PVsplitting.
daniel

Actually, after reading Hyatt's article on DTS, I had decided to start with PV splitting to have a baseline for comparison.

Thanks!

by **Daniel Mehrmann** » 10 Sep 2005, 07:48

Gian-Carlo Pascutto wrote:DTS

If that's too complicated, use YBWC.

Where can i found more information about YBWC ?

Best,
Daniel

by **Steffan Westcott** » 10 Sep 2005, 21:03

Daniel Mehrmann wrote:
Where can i found more information about YBWC ?

Best,
Daniel

Young Brothers Wait Concept is described by Feldmann in his thesis, available here:
http://wwwcs.upb.de/php/monien/mypub.php?paperid=372

Cheers,
Steffan

by **Gian-Carlo Pascutto** » 13 Sep 2005, 13:22

Crafty actually uses YWBC, IIRC.

PV-splitting is a waste of time. If you can get that working DTS is not much harder, YBWC not at all, and you will actually get a speedup.

by **Daniel Mehrmann** » 13 Sep 2005, 13:54

Gian-Carlo Pascutto wrote:Crafty actually uses YWBC, IIRC.

PV-splitting is a waste of time. If you can get that working DTS is not much harder, YBWC not at all, and you will actually get a speedup.

Ops, I thought Crafty using DTS ? :shock:

There are two variants of YBWC, you mean PV-Split isn't good as the other variant ?

Best,
Daniel

by **Daniel Shawul** » 13 Sep 2005, 17:39

DTS needs non-recursive search. I remeber Hyatt said it is possible to do "DTS" using recursive search. But I think this requires a lot more work.

PV-splitting is a waste of time. If you can get that working DTS is not much harder, YBWC not at all, and you will actually get a speedup.

I also did not get much with PVsplit. It was much less than 1.8 (mentioned in many papers for a dual). How much was yours?
Have you implemented full DTS in DeepSjeng?
Eventhough i have a general notion of how DTS works, i don't think i will be able to implement it. IMO it is much much harder than the others.
daniel

by **Daniel Mehrmann** » 13 Sep 2005, 18:40

btw: I found a very good PDF document about the different parallell search algorithms and technics.

You can find it on my homepage: http://www.homer-chess.com under Downloads.

Best
Daniel

by **Gian-Carlo Pascutto** » 14 Sep 2005, 09:08

PV-Split gave me something like 1.2. It's worse than just sharing your hashtable and running 2 processes.

Perhaps that's the thing to try first: just run both processes on the same hashtable. It will give decent speeups, on the order of 1.4-1.5. I know some programmers that never bothered to do more than that

ABDADA might be worth a try, but I don't know anyone who got good results. It's easy if you have the above working, though.

Full DTS is hard because you need a nonrecursive search first. But the DTS papers gives a very good general overview how a parallel algorithm should work. YBWC is actually very similar to it, the main difference is that it can't make splitpoints upwards in the tree (but that also makes it a lot easier).

Deep Sjeng uses a different algorithm. I just combined ideas from DTS and YBWC, and some of my own.

When looking at these numbers, it's important to remember that "tricks" like heavy forward pruning usually worsen the speedup. The ideal situation for parallel searching is perfectly ordered same size trees. The further you go from this, the more problems.

by **Alvaro Begue** » 14 Sep 2005, 13:06

Thanks everyone for the very informative posts. I have another, related, question. I only have access to two-processor machines, but I may get access to plenty of them (EDIT: I also have access to a few 24 CPU machines, but they are much slower). So, are there any algorithms that would work well on a cluster (higher latency, no sharing of hash tables) of, say, 64 nodes?

I remember in conversation with Vincent Diepeveen in 2003 he said he thought such a thing just didn't exist, but to this day I find it hard to believe.

APHID is the only algorithm that I know of that could possibly fit in that type of system, and I plan on toying with it, but I would love to have some alternatives to compare.

by **mathmoi** » 14 Sep 2005, 13:59

Alvaro Begue wrote:Thanks everyone for the very informative posts. I have another, related, question. I only have access to two-processor machines, but I may get access to plenty of them (EDIT: I also have access to a few 24 CPU machines, but they are much slower). So, are there any algorithms that would work well on a cluster (higher latency, no sharing of hash tables) of, say, 64 nodes?

I remember in conversation with Vincent Diepeveen in 2003 he said he thought such a thing just didn't exist, but to this day I find it hard to believe.

APHID is the only algorithm that I know of that could possibly fit in that type of system, and I plan on toying with it, but I would love to have some alternatives to compare.

Hi Alvaro,

Look at this link : http://www.cs.ualberta.ca/~jonathan/Gra ... 02kishi.ps

The algorithm described there actually _use_ a transposition table in a distributed engine. In fact, all the algorithm is centred on the use of a distributed hash-table.

However I think it require a fast network.

by **diepeveen** » 14 Sep 2005, 14:42

Alvaro Begue wrote:Thanks everyone for the very informative posts. I have another, related, question. I only have access to two-processor machines, but I may get access to plenty of them (EDIT: I also have access to a few 24 CPU machines, but they are much slower). So, are there any algorithms that would work well on a cluster (higher latency, no sharing of hash tables) of, say, 64 nodes?

I remember in conversation with Vincent Diepeveen in 2003 he said he thought such a thing just didn't exist, but to this day I find it hard to believe.

APHID is the only algorithm that I know of that could possibly fit in that type of system, and I plan on toying with it, but I would love to have some alternatives to compare.

Hi,

What is the network in the cluster?
myrinet?
quadrics?
dolphin?

In case of myrinet then which myri card is it (the cheaper ones are dead slow compared to the more expensive $1500 ones)?

100 mbit/gigabit really is too slow to be of any realtime usage.

Please note that the supercomputer where diep ran at (512 cpu's),
has similar latencies to the fastest network cards (quadrics). The latest editions of those cards are real good when you have pci-x 133Mhz bus on each mainboard.

However those networks are pretty expensive. A switch for 8 nodes is already like 3500 dollar. Each card has a price of like 999 dollar (quadrics QM500) and so on. Then you need a bunch of cables. An entire 8 node set is about 13095 dollar (quadrics).

Then you have something, that will effectively give a good latency.

The Myri latencies at their homepage are not real latencies, they are measured without software overhead, they are *raw* latencies.

Effectively they are one way pingpong like 8 us or so with MPI.

But still you need to make the software in that case that works for such networks. No way to escape MPI for most networks. Quadrics provides a SHMEM interface that's way faster for computerchess.

The big supercomputers use the same network. For example the big 8000+ itanium2 'nuclear' supercomputer in France is using this quadrics network (with 2 network cards in each node).

No way to beat that in latency at 8000 cpu's. Quadrics at huge number of nodes ( > 64 nodes) is actually far superior in latency to any other network.

So the only thing that matters is what network the cluster has. Whether it's a cluster or 'supercomputer' is not real relevant in that sense.

Yet a 100mbit network you can 'cluster', but a supercomputer never has a 100mbit network

Based upon the type of network, you limit your choices on what parallel algorithm.

But let's be honest.

Even if you have a 64 cpu P4 2Ghz cluster,
how to ever beat a quad opteron dual core with it?

The raw speed of 1 cpu is the problem.

The problem is that the speedup you need to get out of a cluster is so big, that you lose it on the raw speed of 1 opteron cpu.

Diep's speedup (with forwardpruning) was 7.02 out of 8 cpu's at the quad opteron dual core.

Diep's speedup out of a big 'cluster' is 14% to 24%.

14% worst case, 30% best.

let's use 20% speedup on average.

20% from 64 cpu's == 12.8x out of 64

However, a 2Ghz P4 is like a 1Ghz opteron.

6.4Ghz opteron therefore you have.

A quad dual core 2.2Ghz opteron == 8 * 2.2 = 17.2Ghz

See the real problem of clusters?

You need hundreds of cpu's at a fast network to be competative with a quad.

Alternative is you need good nodes. A good node is for example a dual opteron, or even better a dual core dual opteron.

The problem is, most clusters use slow P4 cpu's single core.

there is no dual core P4 Xeons yet...

Please note. cluster or not. YBW is unbeatable.

Vincent

by **Alvaro Begue** » 14 Sep 2005, 15:48

I have good nodes (dual opteron 2.6GHz, plenty of fast RAM) but poor network (Gigabit Ethernet).

by **diepeveen** » 14 Sep 2005, 16:03

Alvaro Begue wrote:I have good nodes (dual opteron 2.6GHz, plenty of fast RAM) but poor network (Gigabit Ethernet).

How many nodes?

If you have just 2, all you need is 2 cards QM500 + cable.
Price perhaps about 2000 euro.

Note that most networks don't work without switch.

Gigabit ethernet has a poor latency. With MPI library and optimized linux kernel for it, count at 100 us to get 8 bytes.

compare with the one way pingpong latency i get here with 2 old quadrics cards (first revision of qm500) between 2 dual k7's with just 66Mhz pci (which is a lot slower than pci-x 133Mhz) which delivers already under 3-4 us for 8 bytes.

So we speak of factor 25 slower that your cards are for chess.

That's pretty much.

Please note the big supercomputer at which i run in 2003 was 5.8 us for 8 bytes *on average*. That's 3-4 us one way pingpong on average.

the highend network cards of today are real real genius in this respect.

Vincent

by **Alvaro Begue** » 14 Sep 2005, 16:20

diepeveen wrote:
Alvaro Begue wrote:I have good nodes (dual opteron 2.6GHz, plenty of fast RAM) but poor network (Gigabit Ethernet).

How many nodes?

If you have just 2, all you need is 2 cards QM500 + cable.
Price perhaps about 2000 euro.

The computers belong to my company, and they have hundreds of them. I think they would let me occasionally use at least 32 of them for my hobby, maybe even 64. Installing any additional hardware is out of the question, since they won't let me jeopardize the stability of the system. Our sysadmins feel very strongly about this, and I think it's with good reason.

Using MPI is not a problem. I did some tests a few days ago and I found it pretty easy to use.

by **diepeveen** » 14 Sep 2005, 16:28

Alvaro Begue wrote:
diepeveen wrote:
Alvaro Begue wrote:I have good nodes (dual opteron 2.6GHz, plenty of fast RAM) but poor network (Gigabit Ethernet).

How many nodes?

If you have just 2, all you need is 2 cards QM500 + cable.
Price perhaps about 2000 euro.

The computers belong to my company, and they have hundreds of them. I think they would let me occasionally use at least 32 of them for my hobby, maybe even 64. Installing any additional hardware is out of the question, since they won't let me jeopardize the stability of the system. Our sysadmins feel very strongly about this, and I think it's with good reason.

Using MPI is not a problem. I did some tests a few days ago and I found it pretty easy to use.

Hi, i have some thoughts on this what you could do in your specific case,
but it won't work for big clusters. Just small ones.

talk to me online otherwise.
icq: 40882246
aim: diepchess
msn: diepchess@hotmail.com (don't email there)
skype: diepchess

Vincent

Winboard Forum

Popular parallel search algorithms?

Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Parallel search algorithms

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Re: Popular parallel search algorithms?

Who is online