So you want to set up a locally hosted AI instance quickly and easily? In short you need three things:
- A base model - I prefer MythoMax 13B, which you can find searching for TheBloke on HuggingFace
- KoboldCpp - Available via GitHub, sometimes in your software repository.
- (Optional) SillyTavern
I’m putting together this advice just for a basic outline, I will update with more specific information over time as questions get asked or anyone contributes additional information.
Choosing a model
If you have a large amount of VRAM in your GPU say 16GB+ you will want to choose a GPU only friendly model for maximum speed, these will end with GPTQ. The more VRAM the better. If you’re happy for your model to be slow or you don’t have that much VRAM you will want to choose a GGUF model. What size model you need and what quantitization you need varies significantly from model to model and machine to machine. With 21GB RAM and 8GB of VRAM, I can comfortably run MythoMax 13B GGUF with 0.75 ROPE (more on this later) to give you a fairly decent starting point, if you have less VRAM/RAM you might want a smaller model, such as OpenOrca 7B GGUF. If you have multiple screens this will also affect your VRAM usage.
GPTQ over GGUF
GGUF with KoboldCPP allows you to use your RAM in place of VRAM, though it appears to cost twice as much RAM as it does VRAM, your mileage may vary. GPTQ as far as I’m aware is limited only to your VRAM and does not work very well with RAM at all. GPTQ is however, much, much faster.
KoboldCPP
KoboldCPP takes some time to install but their instructions are reasonably clear. You will need some additional software to get it to work appropriately and I’ll update this guide as and when I understand which software is needed. All of the software is free, just takes some research to track down which ones you need.
You can select the model you want to use with KoboldCPP this is why you have to download the model. KoboldCPP works best with GGUF models. KoboldCPP also comes with its own web interface, which is why SillyTavern (despite being highly recommended) is optional.
Once installed, if your VRAM is good enough, you can offload all layers to your GPU if it isn’t only offload what the GPU can handle, this varies so you’ll need to use software like RadeonTop to track check, try aiming for around 80-90% of your VRAM/GTT, this should give your system enough to run other activities. Maxing your GPU can cause it not to properly offload and causes additional slow down.
RoPE, if you have enough RAM, you can choose to use it to increase the amount of context you can use. It basically extends a models maximum context size beyond what it would normally allow. With MythoMax I can set the rope config to 0.75 and it will give me 6144 max context (ish).
SillyTavern
Once KoboldCPP is installed, it’s time to install SillyTavern, again this is a fairly simple affair, what is not necessarily simple is linking KoboldCPP.
Once installed head to the plug icon and select “Text Completion”, I started with Chat Completion which was a big mistake on my part.
Padding, if you don’t have padding set properly it can cause your model to generate junk once it has reached the maximum context. I set my padding to 100 and it appears to work great (most of the time).
Disclaimer: I am running GNU/Linux (Manjaro), with an AMD GPU RX590 with 8GB VRAM and ~21GB RAM