So maybe you’ve seen something like this:

Well here are some screenshots from a YT video I found interesting on the subject (video linked at the end). In a nutshell, AVX 512 is an instruction set architecture designed for 512 bit wide instructions. This rabbit hole will maybe help you understand why CPUs are so much slower than a GPU, or at least, if you already know the basics about computing architecture, this will show where the real bottleneck is located.

image image image image image

https://www.youtube.com/watch?v=bskEGP0r3hE

I’m really curious how well various AVX 512 architectures improve CPU performance. Obviously it was worth implementing a lot of these instructions into llama.cpp, so someone felt it was important. In my experience there is no replacement for running a large model. I use a Llama2 70B daily. I can’t add more system memory beyond the 64GB I have. I need to look into the potential to use a swap partition for even larger models, but I haven’t tried this yet. Looking at accessible hardware to use with AI, the lowest cost path to even larger models seems to be a second hand server/workstation with 256-512GB of system memory, as many cores as possible, and whatever the best implementation of AVX512 accessible for a good price then add a 24GBV consumer GPU to this. That could still be less than $3K and on paper it might run a 180B model, and still be cheaper than just a single enterprise 48GBV GPU. Maybe someone here has actual experience with this and how various chipsets handle the load in practice. It is just a curiosity I’ve been thinking about.

  • @BlaedM
    link
    English
    3
    edit-2
    1 year ago

    I have come to believe Moore’s law is finite, and we’re starting to see the exponential end of it. This leads me to believe (or want to believe) there are other looming breakthroughs for compute, optimization, and/or hardware on the horizon. That, or crazy powerful GPUs are about to be a common household investment.

    I keep thinking about what George Hotz is doing in regards to this. He explained on his podcast with Lex Fridman that there is much to be explored in optimization, both with quantization of software and acceleration of hardware.

    His idea of ‘commoditize the petabyte’ is really cool. I think it’s worth bringing up here, especially given the fact it appears one of his biggest goals right now is solving the at-home compute problem. But in a way that you could actually run something like a 180B model in-house no problem.

    George Hotz’ tinybox

    ($15,000)

    • 738 FP16 TFLOPS
    • 144 GB GPU RAM
    • 5.76 TB/s RAM bandwidth
    • 30 GB/s model load bandwidth (big llama loads in around 4 seconds)
    • AMD EPYC CPU
    • 1600W (one 120V outlet)
    • Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)

    You can pre-order one now. You have $15k laying around, right? Lol.

    It’s definitely not easy (or cheap) now, but I think it’s going to get significantly easier to build and deploy large models for all kinds of personal use cases in our near and distant futures.

    If you’re serving/hosting models, it’s also worth checking out vLLM if you haven’t already: https://github.com/vllm-project/vllm

    • @j4k3OP
      link
      English
      31 year ago

      Hardware moves notoriously slow, so I imagine we still have several years before a good solution manifests in the market.

      Somebody needs to build a good Asimov character roleplay and coax the secret for the positronic brain out of him. I’d like to buy the new AMD R-Daneel Olivaw 5000 please. Hell, I’ll settle for a RB-34 Herbie model right now.