What should I use: big model-small quant or small model-no quant?

Smorty [she/her] · edit-2 4 months ago

What should I use: big model-small quant or small model-no quant?

@j4k3 · 4 months ago

I prefer a middle ground. My favorite model is still the 8 x 7b mixtral and specifically the flat/dolphin/maid uncensored model. Llama 3 can be better in some areas but alignment is garbage in many areas.

Smorty [she/her] · 4 months ago

Yeaaa those models are just too large for most people… You gotta have 56GB of VRAM to run an 8bit quant, which most people don’t have a quarter of.

Also, what specifically do you mean by alignment? Are you talking about finetuning or instruction alignment?

ffhein · 4 months ago

Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.

To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it’s worth using 8 bpw.

Personally I always use as large models as I can. With Q2 quantization the 70B models I’ve used occasionally give bad results, but often they feel smarter than 35B Q4. Though it’s ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I’d take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you’ll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.

Smorty [she/her] · 4 months ago

Pulled an 7B Q4 model just now an woah, yeah, they really are a lot better. I guess the smaller models really are just for devices with less than 1 GB of RAM to spare… Like ma phone, which runs Llama3.2 3B just fine…

@j4k3 · edit-2 4 months ago

deleted by creator

Smorty [she/her] · 4 months ago

Another user @[email protected] commented about there being a way to split it between GPU and CPU. Are you talking about this nvidia only and windows only thingy, which only works with the proprietary driver? If so, I’m really not gonna use that…

Have you tried some of the abliterated models? They work really nicely even for the spiciest of topics. They literally can’t refuse your instruction, so they just go ahead and do what you want. But maybe even these models are too narrow for your specific application…

@j4k3 · 4 months ago

deleted by creator