In case anyone isn’t familiar with llama.cpp and GGUF, basically it allows you to load part of the model to regular RAM if you can’t fit all of it in VRAM, and then it splits the inference work between CPU and GPU. It is of course significantly slower than running a model entirely on GPU, but depending on your use case it might be acceptable if you want to run larger models locally.
However, since you can no longer use the “pick the largest quantization that fits in memory” logic, there are more choices to make when choosing which file to download. For example I have 24GB VRAM, so if I want to run a 70B model I could either use a Q4_K_S quant and perhaps fit 40/80 layers in VRAM, or a Q3_K_S quant and maybe fit 60 layers instead, but how will it affect speed and text quality? Then there are of course IQ quants, which are supposedly higher quality than a similar size Q quant, but possibly a little slower.
In addition to the quantization choice, there are additional flags which affect memory usage. For example I can opt to not offload the KQV cache, which would slow down inference, but perhaps it’s a net gain if I can offload more model layers instead? And I can save some RAM/VRAM by using a quantized cache, probably with some quality loss, but I could use the savings to load a larger quant and perhaps that would offset it.
Was just wondering if someone has already done experiments/benchmarks in this area, did not find any exact comparisons on search engines. Planning to do some benchmarks myself but not sure when I have time.
This comes up with some suggestions along those lines as part of its results: https://canirunthisllm.com