• Scott
    link
    fedilink
    English
    29 months ago

    It’s not about their frontend, they are running custom LPUs which can process LLM tokens at 500/sec which is insanely impressive.

    For reference with a max size of 2k tokens, my dual xeon silver 4114 procs take 2-3 minutes.

    • @Finadil
      link
      English
      19 months ago

      That with a fp16 model? Don’t be scared to try even a 4 bit quantization, you’d be surprised at how little is lost and how much quicker it is.

    • @[email protected]
      link
      fedilink
      English
      19 months ago

      Isn’t it those that cost $2000 per 250mb of memory?? Meaning you’d about 350 to load any half decent model.

      • Scott
        link
        fedilink
        English
        29 months ago

        Not sure how they are doing it, but it was actually $20k not $2k for 250mb of memory on the card. I suspect the models are probably cached in system memory.