Alvaro Begue wrote:Thanks everyone for the very informative posts. I have another, related, question. I only have access to two-processor machines, but I may get access to plenty of them (EDIT: I also have access to a few 24 CPU machines, but they are much slower). So, are there any algorithms that would work well on a cluster (higher latency, no sharing of hash tables) of, say, 64 nodes?
I remember in conversation with Vincent Diepeveen in 2003 he said he thought such a thing just didn't exist, but to this day I find it hard to believe.
APHID is the only algorithm that I know of that could possibly fit in that type of system, and I plan on toying with it, but I would love to have some alternatives to compare.
Hi,
What is the network in the cluster?
myrinet?
quadrics?
dolphin?
In case of myrinet then which myri card is it (the cheaper ones are dead slow compared to the more expensive $1500 ones)?
100 mbit/gigabit really is too slow to be of any realtime usage.
Please note that the supercomputer where diep ran at (512 cpu's),
has similar latencies to the fastest network cards (quadrics). The latest editions of those cards are real good when you have pci-x 133Mhz bus on each mainboard.
However those networks are pretty expensive. A switch for 8 nodes is already like 3500 dollar. Each card has a price of like 999 dollar (quadrics QM500) and so on. Then you need a bunch of cables. An entire 8 node set is about 13095 dollar (quadrics).
Then you have something, that will effectively give a good latency.
The Myri latencies at their homepage are not real latencies, they are measured without software overhead, they are *raw* latencies.
Effectively they are one way pingpong like 8 us or so with MPI.
But still you need to make the software in that case that works for such networks. No way to escape MPI for most networks. Quadrics provides a SHMEM interface that's way faster for computerchess.
The big supercomputers use the same network. For example the big 8000+ itanium2 'nuclear' supercomputer in France is using this quadrics network (with 2 network cards in each node).
No way to beat that in latency at 8000 cpu's. Quadrics at huge number of nodes ( > 64 nodes) is actually far superior in latency to any other network.
So the only thing that matters is what network the cluster has. Whether it's a cluster or 'supercomputer' is not real relevant in that sense.
Yet a 100mbit network you can 'cluster', but a supercomputer never has a 100mbit network
Based upon the type of network, you limit your choices on what parallel algorithm.
But let's be honest.
Even if you have a 64 cpu P4 2Ghz cluster,
how to ever beat a quad opteron dual core with it?
The raw speed of 1 cpu is the problem.
The problem is that the speedup you need to get out of a cluster is so big, that you lose it on the raw speed of 1 opteron cpu.
Diep's speedup (with forwardpruning) was 7.02 out of 8 cpu's at the quad opteron dual core.
Diep's speedup out of a big 'cluster' is 14% to 24%.
14% worst case, 30% best.
let's use 20% speedup on average.
20% from 64 cpu's == 12.8x out of 64
However, a 2Ghz P4 is like a 1Ghz opteron.
6.4Ghz opteron therefore you have.
A quad dual core 2.2Ghz opteron == 8 * 2.2 = 17.2Ghz
See the real problem of clusters?
You need hundreds of cpu's at a fast network to be competative with a quad.
Alternative is you need good nodes. A good node is for example a dual opteron, or even better a dual core dual opteron.
The problem is, most clusters use slow P4 cpu's single core.
there is no dual core P4 Xeons yet...
Please note. cluster or not. YBW is unbeatable.
Vincent