• @j4k3
    link
    English
    24 days ago
    Chips and Cheese is primarily where I read about it. Basically, as I understand it, super fast single thread CPU speeds are the reason why the cache bus width has remained small, and also must remain small. Parallel data has too many CLR (Capacitive/Inductive/Resistive) tradeoffs. The CPU speeds are well into radio frequencies where any stray capacitance or inductance is unacceptable, and signals no longer need real physical connections in many cases where capacitive coupling can be used to make connections.

    The primary bottleneck with AI workloads is getting the tensor math into and out of the ALU (arithmetic logic unit). The x86 instruction set already exists for loading large tensors. It is called AVX. There are several versions of the AVX instruction set. Generally AVX-512 and newer are what you really want. AVX-512 loads a 512-bit wide word in one instruction. Unfortunately most newer and more advanced AVX instructions only exist in the full P-cores of Intel hardware. These are obfuscated in consumer hardware because the E-cores do not have this instruction. The CPU scheduler of desktop operating systems would need a rewrite to handle asymmetrical threading to make these instructions available. Speculatively, these AVX instructions are likely at the heart of Intel’s actual problems with 13th gen to present systems. 12th gen started with P-core advanced AVX instructions simply missing from the manifest of instructions in the microcode. A motherboard manufacturer discovered this and made a provision to add the server P-core AVX/microcode while turning off the E-cores to avoid CPU threading issues. This enabled running the much more efficient instructions. Intel couldn’t handle this kind of creative licence and real world performance improvement they did not exploit directly, so they started trying to fuse off the AVX instructions midway through 12th gen production. Then came the problems. They didn’t design the instructions to be fused off, they went backwards, found a vulnerable spot and hacked in a way to fuse them off… but I digress…

    Even with advanced AVX, you’re still limited by how many tensors can get in and out of the cache. The big choke point is between L2 and L1. L3 is further away and shared with all. The CPU architecture is simply designed to optimize for serial throughput of instructions largely because consumers on average focus only on CPU number bigger, is more betterer type purchases.

    AI can run MUCH faster if the layers and block sizes are matched to the architecture limitations of the hardware they are running on. You may be loading all but some last few bits of data that are slightly larger than your total cache throughput, flushing, and reloading in ways that greatly increase the inference run time. The math and operations are around or beyond the total capabilities of the system design.

    AI workloads can also create issues with adjacent cores throttling back because the CPU was never envisioned to encounter a situation where all cores would have the same type of load and all loads are at maximum throughput. The only hardware really made for this are some servers. From what I have seen, the server/workstation hardware new enough to have more advanced AVX instructions supported by llama.cpp, are still too expensive to justify trying to see how they compare with a GPU/CPU hybrid.

    I don’t think the GPU or present CPU have a future. Within the next 8 years compute will go in a whole new direction because data centers that have two types of processors for different workloads is untenable nonsense. Anyone that can make a good-enough architecture capable of all workloads will own the future. This will require lower single thread speeds to enable and accomplish far more parallelism. For AI, the processor speed is not as relevant as throughput.

    So I’m looking for how someone adds more processors, or enables AVX like instructions with a plan, or how they Frankenstein together a brute for approach with older hardware.

    Another key aspect is that the GPU does not have the same type of memory management as the much older CPU architecture. GPU memory is tied directly to the computational hardware. System memory on the other hand is accessed very slowly because it is like a sliding window. The CPU can only access a very small slice of the total system memory at any given point. All of the GPU memory is available at once. So when someone says unified memory the memory management architecture becomes a curiosity and dubious space to watch.

    I’m no expert. I am curious. This is as best as I understand in abstraction. I am likely incorrect in parts and pieces, but I’m always eager to learn.

    • @[email protected]
      link
      fedilink
      34 days ago

      You make a lot of good points in here but I think you are slightly off on a couple key points.

      These are ARM not x64 so they use SVE2 which can technically scale to 2048 rather than 512 of AVX. Did they scale it to that, I’m unsure, existing Grace products are 4x128 so possibly not.

      Second this isn’t meant to be a performant device, it is meant to be a capable device. You can’t easily just make a computer that can handle the compute complexity that this device is able to take on for local AI iteration. You wouldn’t deploy with this as the backend, it’s a dev box.

      Third the CXL and CHI specs have coverage for memory scoped out of the bounds of the host cache width. That memory might not be accessible to the CPU but there are a few ways they could optimize that. The fact that they have an all in a box custom solution means they can hack in some workarounds to execute the complex workloads.

      I’d want to see how this performs versus an i9 + 5090 workstation but even that is going to already go beyond the price point for this device. Currently a 4090 is able to handle ~20b params which is an order of magnitude smaller than what this can handle.