PicoCray: Unveiling the Power of Raspberry Pi Pico Clusters

minnix · 1 year ago

PicoCray: Unveiling the Power of Raspberry Pi Pico Clusters

@dragontamer · edit-2 1 year ago

I don’t like these in most cases. Before yall yell at me, lemme explain.

Node-to-Node communication is a massively important problem. The easiest way to solve node-to-node communication is to have all the devices on the same silicon die. IE: Buy a 64-core EPYC. (Note: internally, AMD actually solved die-to-die communications through their infinity fabric and that’s the key to their high-speed core-to-core communications despite having so many cores on one package).
Node-to-Node communication is a massively important problem. Once you maximize the size of a singular package, like 64-core EPYCs, the next step is to have chip-to-chip communications. Such as the Dual-socket (2x CPUs running on one motherboard). In practice, this is an extension to AMD’s Infinity Fabric. Note that Intel has a Ultrapath Interconnect that works differently, but has similar specs (8-way cpu-to-cpu communications, NUMA awareness, etc. etc.)
Node-to-Node communication is a massively important problem. Once you’ve maximized the speed possible on a singular motherboard, your next step is to have a high-speed motherboard-to-motherboard connection. NVidia’s NVLink is perhaps the best example of this, with GPU-to-GPU communications measured on the order of TBs/second.
Node-to-Node communication is a massively important problem. Once you’ve maximized NVidia’s NVLink, you use NVidia’s NVSwitch to expand communication to more GPUs.
Node-to-Node communication is a massively important problem. Once you’ve maximized a cluster with Dual-socket EPYCs and NVLink + NVSwitch GPUs, you then need to build out longer-scale communication networks. 10Gbit Ethernet can be used, but 400Gbit Infiniband is popular amongst nation-state supercomputers for a reason. I think I’ve read some papers that 100Gbit Fiber Optics or 40Gbit Fiber Optics is a good in-between and yields acceptable results (not as fast as Infiniband, but still much faster than your standard RJ-45 based consumer ethernet). 10Gbit Ethernet was used on some projects IIRC, so if you’re trying to save money on the interconnect, its still doable.

So when I see "Someone builds a clustered-computer out of 1Mbit (aka: 0.000001 Tbit/sec) I2C communications, its hard for me to get excited, lol. The entire problem space is in practice, defined by the gross difficulty in computer-to-computer communications… and I2C is just not designed for this space. Surprisingly, modern supercomputers are “just servers”, so anyone who has experience with the $20,000-class set of Xeons or EPYCs has experience with real supercomputer hardware today. (Which yall can rent for just a few dozen $ per day from Amazon Web services btw today, if you really wanted to. Cloud computing has made high-performance computing accessible in practice to even the cheapest hobbyist)

Now… when am I excited about “cheap clusters” ?? Well, the #1 problem with the approach I listed with 1-5 above is that such a beastly computer costs $10 Million or more. Even entire nation-states can struggle to find the budget for this, let alone smaller corporations or hobbyists. But the “skills” needed to program a $10-million supercomputer are still required, so we need to think about how to train the next generation of programmers to use these expensive supercomputers. (Unlike an AWS rented instance, the Rasp. Pi cluster has to be taken care of by the administrator, building real administration skills)

There was a project about Ethernet + Rasp. Pi that used MPI (Message Passing Interface) that handles this latter case. By using Rasp. Pis + standard ethernet switches as the basis of the cluster, it would only cost $thousands of dollars, not $millions to build a large cluster of hundreds of Rasp. Pis. MPI is one of the real APIs that are used on the big-boy nation-state level supercomputers as well. Rasp. Pi are not NUMA-aware nor do they have a good GPU-programming interface however, so its not a perfect emulation of the issues. But its good enough to teach students. A Rasp. Pi supercomputer never will be “useful” in the real world outside of training, but student-training is a good enough reason to be excited.

I look at this Rasp. Pi Pico Cluster and… its not clear to me how this teaches people of the “big boy” supercomputers, or how it’d be more useful than standard multi-core programming. I2C is not the language of high-performance-computers. And Rasp. Pi Pico cannot run Linux or other OSes that’d teach practical administration skills either.

For the embedded hobbyist, I’d suggest grabbing a Xilinx Zynq FPGA+ARM chip, and experimenting with the high-performance compute available to you from custom FPGAs. That’s how the satellite and military gets large amounts of computational power into small power-envelopes, which is likely why you’d be “interested” in a Rasp. Pi pico in the first place. (Power-constraints due to small Satellites or Military weight restrictions prevent them from using real world supercomputers on RADARs or whatever). You can reach absurdly powerful levels of compute with incredibly low amounts of power-usage with this pattern. (And I presume anyone in the “Microcontrollers” discussion is here because we have an interest in power-constrained computing).

If Power is not a constraint to you… then you can study up on GPU-programming / Xeon-servers / EPYC-servers / etc. etc. for the big stuff. I am the moderator at https://lemmy.world/c/gpu_programming btw, so we can talk more over there if you’re interested in the lower level GPU-programming details that’d build up to a real supercomputer. The absurd amounts of compute available for just $500 or so today cannot be overstated. An NVidia 4080 or an AMD 7900 XTX have more compute power than entire Cray Supercomputing Clusters of the early 2000s. Learning how to unlock this power is what GPU-programming (CUDA, DirectX, HIP/ROCm, OpenCL) is all about. I’m no expert in how to hook up these clusters together with MPI / Infiniband / etc. etc., but I can at least help you on this subject.

@[email protected] · 1 year ago

Every time.

Every time someone posts their homebrew cluster in any public forum, someone has to point out how useless it is compared to state of the art HPC. It doesn’t matter if it’s a pile of dumpster’d thin clients or a PCB full of CH32V003s, it’s always somehow misguided because there’s no RDMA, or high radix switch fabric, or enough horsepower to run weather models.

You act like the project’s creator doesn’t know this is a toy and thinks it’s a TOP500 contender, and make some wild suggestions that they trash everything and buy FPGAs, GPUs, EPYCs, and a bottomless pit of cloud cycles. Good for you, you also realized that Cray isn’t going to base Slingshot 2 on I2C. Pat yourself on the back, you earned it.

Since we’re slinging unsolicited advice, here’s a bit more: if someone shares their accomplishment, regardless of how fundamentally flawed it is, it costs you nothing and is far more helpful to say “Hey, that’s awesome! I like how you did $FEATURE. Great job!” and stop right there than to be condescending and nitpicky.

@dragontamer · edit-2 1 year ago

Since we’re slinging unsolicited advice, here’s a bit more: if someone shares their accomplishment, regardless of how fundamentally flawed it is, it costs you nothing and is far more helpful to say “Hey, that’s awesome! I like how you did $FEATURE. Great job!” and stop right there than to be condescending and nitpicky.

Sure. You first. Please tell me what $FEATURE of this cluster you like. The best I got is “It looks like Cray from the 80s”, but I’ve usually cared about more software or hardware features than just looks.

I don’t necessarily think that clusters need to be built out of the latest-and-greatest parts. I really do think that a Rasp.Pi Cluster with MPI is more than enough for many students and hobbyists. I also think there are other parts you can use to do that (ex: maybe use a TI Sitara or something), and you’d actually get something respectable from a software perspective.

And BTW: Zynq FPGA is low-end and relatively basic. Its again, the software (or in this case, the VHDL or Verilog design you program into the FPGA). Everyone in this hobby can afford a Zynq, with some dev-boards in the $150 range. That’s why I pushed it in my post earlier. If that’s still too much for you, there’s cheaper FPGAs but Zynq is a good one to start with since its an ARM core + FPGA combo, which is very useful in practice.

@[email protected] · 1 year ago

Sure, no prob.

Hey, that’s awesome! I like the random holdoff for address assignment, the simple workload distribution mechanism, the awareness in the writeup that I2C was a low effort communications mechanism, not letting perfection get in the way of getting the job done, and the cushions on the enclosure with colors that are a nod to the canonical X-MP skins. Great job, Derek!