Guide to Self Hosting LLMs Faster/Better than Ollama

@brucethemoose · edit-2 3 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama

DarkThoughts · 3 months ago

I just can’t get ROCm / gpu generation to work on Bazzite, like at all. It seems completely cursed. I tried koboldcpp through a Fedora distrobox and it didn’t even show any hardware options. Tried through an Arch AUR package through distrobox and the ROCm option is there but ends with a CUDA error. lol The Vulkan option works but seems to still use the CPU more than the GPU and is consequently still kinda slow and I struggle to find a good model for my 8GB card. Fimbulvetr-10.7B-v1-Q5_K_M for example was still too slow to be practical.

Tried LM Studio directly in Bazzite and it also just uses the CPU. It also is very obtuse on how to connect to it with SillyTavern, as it asks for an API key? I managed it once in the past but I can’t remember how but it also ended up stopping generating anything after a few replies.

Krita’s diffusion also only runs on the CPU, which is abysmally slow, but I’m not sure if they expect Krita to be build directly on the system for ROCm support to work.

I’m not even trying to get SDXL or something to run at this point, since that seems to be still complicated enough even on a regular distro.

@brucethemoose · edit-2 3 months ago

I don’t like Fedora because its CUDA support is third party, and AFAIK they dont natively package ROCm. And its too complex to use through something like distrobox… I don’t want to tell you to switch OSes, but you’d have a much better time with CachyOS, which is also optimized for Steam gaming.

Alternatively you could try installing rocm images through docker, but you have to make sure GPU passthrough is working).

It also depends on your GPU. If you are on an RX 580, you can basically kiss rocm support goodbye, and might want to investigate mlc-llm’s vulkan backend.

Fimbulvetr is ancient now, your go to models are Qwen 2.5 14B at short context or llama 3.1 8B/Qwen 2.5 7B at longer context.

DarkThoughts · 3 months ago

I distrohopped so much after each previous distro eventually broke and me clearly not being smart enough to recover. I’m honestly kinda sick of it, even if the immutable nature also annoys the shit out of me.

My GPU is a 6650 XT, which should in principle work with ROCm.

Which model specifically are you recommending? Llama-3.1-8B-Lexi-Uncensored-V2-GGUF? Because the original meta-llama ones are censored to all hell and Huggingface is not particularly easy to navigate, on top of figuring out the right model size & quantization being extremely confusing.

@brucethemoose · edit-2 3 months ago

Depends what you mean by censored. I never have a problem with Qwen or llama as long as I give them the right prompt and system prompt. Its not like an API model, they have to continue whatever response you give them.

And… For what? If you are just looking for like ERP, check out drummer’s finetunes. Otherwise I tend to avoid “uncensored” finetunes as they dumb the model down a bit, but take your pick: https://huggingface.co/models?sort=modified&search=14B

But you are going to struggle if you can’t get rocm working beyond very small context, as that means no flash attention anywhere.

Also, assuming you end up using kobold.cpp-rocm instead, I would use a IQ3_M or IQ3_XS GGUF quantization of a 14B model.

DarkThoughts · 3 months ago

Well, anything remotely raunchy gets a “I cannot participate in explicit content” default reply.

I am using the rocm install of koboldcpp but as said, the ROCm option errors out with a CUDA error for some reason.

@brucethemoose · edit-2 3 months ago

Oh, and again, for raunchy, there are explicit “RP” finetunes, like: https://huggingface.co/TheDrummer

But you just need to set a good system prompt or start a reply with “Sure,” and plain qwen or llama will write out unspeakable things.

@brucethemoose · 3 months ago

What’s the error? Did you manually override your architecture as an environment variable?

https://old.reddit.com/r/ROCm/comments/18z29l6/comment/kgeuguq/

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU?tab=readme-ov-file#additional-information--installation-tips

You are gfx1032

DarkThoughts · 3 months ago

ggml_cuda_compute_forward: ADD failed
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2365
  err
ggml/src/ggml-cuda.cu:107: CUDA error

I didn’t do anything past using yay to install the AUR koboldcpp-hipblas package, and customtkinter, since the UI wouldn’t work otherwise. The koboldcpp-rocm page very specifically does not mention any other steps in the Arch section and the AUR page only mentions the UI issue.

@brucethemoose · edit-2 3 months ago

mmmm I would not use the AUR version, especially on Fedora. It probably relies on a bunch of arch system packages, among other things.

Try installing the rocm fork directly, with its script: https://github.com/YellowRoseCx/koboldcpp-rocm?tab=readme-ov-file#linux

EDIT: There does seem to be a specific quirk related to Fedora.

DarkThoughts · 3 months ago

I’m not using Fedora, I’m using Bazzite, which is immutable based on SilverBlue. I use an Arch distrobox for this since I can’t really install anything directly into the system. The script is what I tried originally in a Fedora distrobox which did not work at all.