Self-hosting LLMs

@[email protected] · 22 days ago

Self-hosting LLMs

chiisana · 22 days ago

Using Ollama to try a couple of models right now for an idea. I’ve tried to run Llama 3.2 and Qwen 2.5 3b, both of which fits my 3050 6G’s VRAM. I’ve also tried for fun to use Qwen 2.5 32b, which fits in my RAM (I’ve got 128G) but it was only able to reply a couple of tokens per second, thereby making it very much a non-interactive experience. Will need to explore the response time piece a bit further to see if there are ways I can lean on larger models with longer delays still.

Smorty [she/her] · 16 days ago

Please try the 4 bit quantisations of the models. They work a bunch faster while eating less RAM.

Generally you want to use 7B or 8B models on the CPU, since everything above will be hellishly slugish.