I’d like to self host a large language model, LLM.

I don’t mind if I need a GPU and all that, at least it will be running on my own hardware, and probably even cheaper than the $20 everyone is charging per month.

What LLMs are you self hosting? And what are you using to do it?

  • chiisana
    link
    fedilink
    English
    522 days ago

    Using Ollama to try a couple of models right now for an idea. I’ve tried to run Llama 3.2 and Qwen 2.5 3b, both of which fits my 3050 6G’s VRAM. I’ve also tried for fun to use Qwen 2.5 32b, which fits in my RAM (I’ve got 128G) but it was only able to reply a couple of tokens per second, thereby making it very much a non-interactive experience. Will need to explore the response time piece a bit further to see if there are ways I can lean on larger models with longer delays still.

    • Smorty [she/her]
      link
      fedilink
      English
      116 days ago

      Please try the 4 bit quantisations of the models. They work a bunch faster while eating less RAM.

      Generally you want to use 7B or 8B models on the CPU, since everything above will be hellishly slugish.