What should I use: big model-small quant or small model-no quant?

Smorty [she/her] · edit-2 2 months ago

What should I use: big model-small quant or small model-no quant?

ffhein · 2 months ago

Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.

Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.