For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!

The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn’t stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.

  • Smorty [she/her]OP
    link
    fedilink
    English
    23 months ago

    Hmm, so what you’re saying is that for creative generations one should use big parameter models with strong quants but when good structure is required, like with coding and JSON output, we want to use a large quant of a model which actually fits into our VRAM?

    I’m currently testing JSON output, so I guess a small Qwen model it is! (they advertised good JSON generations)

    Does the difference between fp8 and fp16 influence the structure strongly, or are fp8 models fine for structured content?

      • Smorty [she/her]OP
        link
        fedilink
        English
        23 months ago

        Ollama does indeed have the ability to share the memory between VRAM and RAM, but I always assumed it wouldn’t make sense, since it would massively slow down the generation.

        I think ollama already uses GGUF, since that is how you import the model from HF to ollama anyway, you gotta use the *.GGUF file.

        As someone who has experience with shader development in glsl, I know very well that communication between the GPU and CPU is super slow, and sending data from the GPU to the CPU is a pretty heavy task. So I just assumed it wouldn’t make any sense. I will try a full 7B model (fp16) model now using my 32GB of normal RAM to check out the speed. I’ll edit this comment once I’m done and share results

          • Smorty [she/her]OP
            link
            fedilink
            English
            13 months ago

            oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…

            • ffhein
              link
              English
              23 months ago

              Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.

              Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.