For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • ffhein
    link
    English
    117 hours ago

    Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.

    To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it’s worth using 8 bpw.

    Personally I always use as large models as I can. With Q2 quantization the 70B models I’ve used occasionally give bad results, but often they feel smarter than 35B Q4. Though it’s ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I’d take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you’ll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.

    • Smorty [she/her]OP
      link
      fedilink
      English
      114 hours ago

      Pulled an 7B Q4 model just now an woah, yeah, they really are a lot better. I guess the smaller models really are just for devices with less than 1 GB of RAM to spare… Like ma phone, which runs Llama3.2 3B just fine…