Do i need industry grade gpu’s or can i scrape by getring decent tps with a consumer level gpu.

  • hendrik
    link
    fedilink
    English
    7
    edit-2
    15 days ago

    I’d say you’re looking for something like a 80GB VRAM GPU. That’d be industry grade (an Nvidia A100 for example).

    And to squeeze it into 80GB the model would need to be quantized to 4 or 5 bits. There are some LLM VRAM calculators available where you can put in your numbers, like this one.

    Another option would be to rent these things by the hour in some datacenter (at about $2 to $3 per hour). Or do inference on a CPU with a wide memory interface. Like an Apple M3 processor or an AMD Epyc. But these are pricey, too. And you’d need to buy them alongside an equal amount of (fast) RAM.

  • @[email protected]
    link
    fedilink
    English
    5
    edit-2
    15 days ago

    If you’re running a consumer level GPU, you’ll be operating with 24GB of VRAM max (RTX 4090, RTX 3090, or Radeon 7900XTX).

    90b model = 90GB at 8-bit quantization (plus some extra based on your context size and general overhead, but as a ballpark estimate, just going by the model size is good enough). You would need to drop down to 2-bit quantization to have any hope to fit it in a single consumer GPU. At that point you’d probably be better off using a smaller model will less aggressive quantization, like a 32b model at 4-bit quantization.

    So forget about consumer GPUs for that size of model. Instead, you can look at systems with integrated memory, like a Mac with 96-128GB of memory, or something similar. HP has announced a mini PC that might be good, and Nvidia has announced a dedicated AI box as well. Neither of those are available for purchase yet, though.

    You could also consider using multiple consumer GPUs. You might be able to get multiple RTX 3090s for cheaper than a Mac with the same amount of memory. But then you’ll be using several times more power to run it, so keep that in mind.

    • Possibly linux
      link
      fedilink
      English
      114 days ago

      You can use system ram for when the GPU memory fills up. Alternatively you can run multiple GPUs.

  • @Sylovik
    link
    English
    415 days ago

    In case of LLM’s you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.

  • @tpWinthropeIII
    link
    English
    415 days ago

    The new $3000 NVidia Digit has 128 GB of fast RAM in an Apple-M4-like unified-memory configuration, reportedly. NVidia claims it is twice as fast as an apple stack at least at inference. Four of these stacked can run a 405B model, again according to NVidia.

    In my case I want the graphics power of an GPU and VRAM for other purposes as well. So I’d rather buy a graphics card. But regarding a 90B model, I do wonder if it is possible with two A6000 at 64 GB and a 3 bit quant.

    • Huh so basicly sidestepping the gpu issue entirly and essentially just using some other special piece of silicon with fast (but conventional ram). I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.

      • @tpWinthropeIII
        link
        English
        315 days ago

        Not exactly. Digits still uses a Blackwell GPU, only it uses unified RAM as virtual VRAM instead of actual VRAM. The GPU is probably a down clocked Blackwell. Speculation I’ve seen is that these are defective and repurposed Blackwells; good for us. By defective I mean they can’t run at full speed or are projected to have the cracking die problem, etc.

      • @breakingcups
        link
        English
        213 days ago

        I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.

        Because each weight in a layer influences each weight in the next layer, which means the bandwidth requirements are enormous and regular networking solutions are insufficient for that.

  • @[email protected]
    link
    fedilink
    English
    315 days ago

    all you need is 2 3090s and some good CPU and lot of ram, then you should be able to get like 10tps

  • @[email protected]
    link
    fedilink
    English
    215 days ago

    The biggest issue will be your VRAM. If you don’t have enough of it (which is very likely, even the 8B models I use need ~10gb), you’ll have to use a GGUF model which will need to use your system RAM and CPU for the parts that don’t fit in the VRAM, which will heavily slow it down.

  • ffhein
    link
    English
    215 days ago

    You have to specify which quantization you find acceptable, and which context size you require. I think the most affordable option to run large models locally is still getting multiple RTX3090 cards, and I guess you probably need 3 or 4 of those depending on quantization and context.