- cross-posted to:
- [email protected]
- [email protected]
- cross-posted to:
- [email protected]
- [email protected]
Article says nothing really relevant about the architecture and implementation. Unified memory says nothing. Is it using a GPU-like math coprocessor or just extra cores. If it is just the CPU, it will have the same limitations of cache bus width. If it is a split workload, it will be limited in tools that can split the math. The article is comparing this to ancient standards of compute like 32 GB of system memory when AI has been in the public space for nearly 2 years now. For an AI setup 64 GB to 128 GB is pretty standard. It also talks about running models that are impossible to fit on even a dual station setup in their full form. You generally need twice the memory of the model parameters size to load the full version. You need the full version for any kind of training but not for general use. Even two of these systems at 256 GB of memory is not going to load a 405B model at full precision. Sure it can run a quantized version, and that will be like having your own ChatGPT 3.5, but it is not training or some developer use case. Best case scenario, this would load a 70B at full precision. A 70B on an Intel 12th gen 20 logical core CPU, 64 GB of DDR5 at max spec speed, and a 16 GB 3080Ti GPU loads a 70B with Q4L quantization and streams slightly slower than my natural reading pace. There is no chance that this could be used in anything like an agent or for more complex tasks that require interaction. My entire reason for clicking on the article was to see the potential architecture difference that might enable larger models beyond the extremely limited CPU cache bus width problem present in all CPU architectures. The real world design life cycle of hardware is 10 years. So real AI specific hardware is still at least 8 years away. Anything in the present is marketing wank and hackery. Hackery is interesting, but it is not in this text.
NLBDR - No line breaks didn’t read
Can you tell us more about the CPU cache width problem?
Chips and Cheese is primarily where I read about it. Basically, as I understand it, super fast single thread CPU speeds are the reason why the cache bus width has remained small, and also must remain small. Parallel data has too many CLR (Capacitive/Inductive/Resistive) tradeoffs. The CPU speeds are well into radio frequencies where any stray capacitance or inductance is unacceptable, and signals no longer need real physical connections in many cases where capacitive coupling can be used to make connections.
The primary bottleneck with AI workloads is getting the tensor math into and out of the ALU (arithmetic logic unit). The x86 instruction set already exists for loading large tensors. It is called AVX. There are several versions of the AVX instruction set. Generally AVX-512 and newer are what you really want. AVX-512 loads a 512-bit wide word in one instruction. Unfortunately most newer and more advanced AVX instructions only exist in the full P-cores of Intel hardware. These are obfuscated in consumer hardware because the E-cores do not have this instruction. The CPU scheduler of desktop operating systems would need a rewrite to handle asymmetrical threading to make these instructions available. Speculatively, these AVX instructions are likely at the heart of Intel’s actual problems with 13th gen to present systems. 12th gen started with P-core advanced AVX instructions simply missing from the manifest of instructions in the microcode. A motherboard manufacturer discovered this and made a provision to add the server P-core AVX/microcode while turning off the E-cores to avoid CPU threading issues. This enabled running the much more efficient instructions. Intel couldn’t handle this kind of creative licence and real world performance improvement they did not exploit directly, so they started trying to fuse off the AVX instructions midway through 12th gen production. Then came the problems. They didn’t design the instructions to be fused off, they went backwards, found a vulnerable spot and hacked in a way to fuse them off… but I digress…
Even with advanced AVX, you’re still limited by how many tensors can get in and out of the cache. The big choke point is between L2 and L1. L3 is further away and shared with all. The CPU architecture is simply designed to optimize for serial throughput of instructions largely because consumers on average focus only on CPU number bigger, is more betterer type purchases.
AI can run MUCH faster if the layers and block sizes are matched to the architecture limitations of the hardware they are running on. You may be loading all but some last few bits of data that are slightly larger than your total cache throughput, flushing, and reloading in ways that greatly increase the inference run time. The math and operations are around or beyond the total capabilities of the system design.
AI workloads can also create issues with adjacent cores throttling back because the CPU was never envisioned to encounter a situation where all cores would have the same type of load and all loads are at maximum throughput. The only hardware really made for this are some servers. From what I have seen, the server/workstation hardware new enough to have more advanced AVX instructions supported by llama.cpp, are still too expensive to justify trying to see how they compare with a GPU/CPU hybrid.
I don’t think the GPU or present CPU have a future. Within the next 8 years compute will go in a whole new direction because data centers that have two types of processors for different workloads is untenable nonsense. Anyone that can make a good-enough architecture capable of all workloads will own the future. This will require lower single thread speeds to enable and accomplish far more parallelism. For AI, the processor speed is not as relevant as throughput.
So I’m looking for how someone adds more processors, or enables AVX like instructions with a plan, or how they Frankenstein together a brute for approach with older hardware.
Another key aspect is that the GPU does not have the same type of memory management as the much older CPU architecture. GPU memory is tied directly to the computational hardware. System memory on the other hand is accessed very slowly because it is like a sliding window. The CPU can only access a very small slice of the total system memory at any given point. All of the GPU memory is available at once. So when someone says unified memory the memory management architecture becomes a curiosity and dubious space to watch.
I’m no expert. I am curious. This is as best as I understand in abstraction. I am likely incorrect in parts and pieces, but I’m always eager to learn.
You make a lot of good points in here but I think you are slightly off on a couple key points.
These are ARM not x64 so they use SVE2 which can technically scale to 2048 rather than 512 of AVX. Did they scale it to that, I’m unsure, existing Grace products are 4x128 so possibly not.
Second this isn’t meant to be a performant device, it is meant to be a capable device. You can’t easily just make a computer that can handle the compute complexity that this device is able to take on for local AI iteration. You wouldn’t deploy with this as the backend, it’s a dev box.
Third the CXL and CHI specs have coverage for memory scoped out of the bounds of the host cache width. That memory might not be accessible to the CPU but there are a few ways they could optimize that. The fact that they have an all in a box custom solution means they can hack in some workarounds to execute the complex workloads.
I’d want to see how this performs versus an i9 + 5090 workstation but even that is going to already go beyond the price point for this device. Currently a 4090 is able to handle ~20b params which is an order of magnitude smaller than what this can handle.
Wow, that was way more thorough than I expected, thanks for taking the time!
It’s not a real problem for a system like this. The system uses CXL. Their rant is just because they didn’t take the time to do a click down into what the specs are.
The system uses CXL/AMBA CHI specs under NVLink-C2C. This means the memory is linked both to the GPU directly as well as to the CPU.
All of their complaints are pretty unfounded in that case and they would have to rewrite any concerns taking into account those specs.
Check https://www.nvidia.com/en-us/project-digits/ which is where I did my next level dive on this.
EDIT: This is all me assuming they are talking about the bandwidth requirements of allocating all memory as being CPU allocation rather than enabling concepts like LikelyShared vs Unique.
tl (and too few lbs) dr, anyone? 5 word summary?
Article lacks relevant architectural details.
Wall of text when it says its a Grace Blackwell, which is a CPU/GPU combo with ultra high speed interconnects.
Can it run modded Fallout 4 without lag?
Obsolete by the time its delivered.