Don't skimp on the quant when using MoE

troed@fedia.io · 8 days ago

Don't skimp on the quant when using MoE

brockhold · 4 days ago

Ryzen 5950X with a slight underclock
2x32+2x16 DDR4 2666
Radeon 7900XTX 24GB with a 250W power limit
15tps is entirely usable as an agent, but I can’t go above 131k ctx and must set --parallel 1 to fit in available RAM+VRAM. It actually still OOMs periodically, as even with nothing else running it only has a couple GB to work with. If that happens with context mostly filled then you’re looking at ten minutes of prompt processing before next token.

I’m not about to buy another 64GB pair to replace the 16s, but… I wish I had done that years ago.

Don't skimp on the quant when using MoE

Don't skimp on the quant when using MoE

Qwen3.6 - How to Run Locally | Unsloth Documentation