Guide to Self Hosting LLMs Faster/Better than Ollama

@brucethemoose · edit-2 4 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama

@[email protected] · 4 months ago

I tried llama.cpp with llama-server and Qwen2.5 Coder 1.5B. Higher parameters just output garbage and I can see an OutOfMemory error in the logs. When trying the 1.5B model, I have an issue where the model will just stop outputting the answer, it will stop mid sentence or in the middle of a class. Is it an issue with my hardware not being performant enough or is it something I can tweak with some parameters?

@brucethemoose · edit-2 4 months ago

You can only allocate so much to metal backends, and if you are on (say) an 8GB Mac there won’t be much RAM left for the LLM itself.

But still, use a tighter quantization (like an IQ4 or IQ3_KM) of Qwen Coder 7B, and close as many background programs as you can. It should be small enough to fit.

@[email protected] · 4 months ago

I have a MacBook Pro M1 Pro with 16GB RAM. I closed a lot of things and managed to have 10GB free, but that seems to still not be enough to run the 7B model. For the answer being truncated, it seems to be a frontend issue. I tried open-webui connected to llama-server and it seems to be working great, thank you!

@brucethemoose · 4 months ago

Try reducing the context size, and make sure Q8/Q8 flash attention is enabled with flags.

I’d link a specific GGUF quantization, but huggingface seems to be down for me!

@brucethemoose · 4 months ago

Try this one at least, it should still leave plenty of RAM free: https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-IQ4_XS.gguf

@[email protected] · 4 months ago

Indeed, this model is working on my machine. Can you explain the difference with the one I tried before?

@brucethemoose · edit-2 4 months ago

It’s probably much smaller than whatever other GGUF you got, aka more tightly quantized.

Look at the filesize, thats basically how much RAM it takes.

@[email protected] · 4 months ago

Well this is what I quite don’t understand: I was trying to run the q3_k_m which is 3.81GB and it was failing with OutOfMemory error. The one you provided IQ4_XS is 4.22GB and is working fine.

@brucethemoose · edit-2 4 months ago

Shrug did you grab an older Qwen GGUF? The series goes pretty far back, and its possible you grabbed one that doesn’t support GQA or something like that.

Doesn’t really matter though, as long as it works!

@brucethemoose · 4 months ago

deleted by creator