Maybe it was just me, but in case others have done the same this post might help someone else too.
I have a workstation with plenty of CPU and system RAM, but I’m “GPU poor” in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.
I’m exclusively using this setup for development and system management tasks, and I’ve found Qwen 3.6 35B A3B to excel compared to other models. I don’t have the VRAM to run the 27GB dense model, so I’ve spent time on getting the best usage out of the MoE.
Or so I thought. Since “everyone” says to use Unsloth UD-Q4_K_XL that’s the quant I’ve been using, and I’ve gone a bit back’n’forth with MTP/no MTP, UB increase, mmproj since I’ve also started using a browser MCP etc.
Today I took another look at their quant chart and thought that since it’s MoE maybe I could run Q5_K_S which would be a step up?
Well. Now I’m using Q6_K because it turns out I could run that with the exact same settings as I’ve optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I’ve already noticed how it’s able to get out of loops while before I had to interrupt it sometimes.
This is my setup. I get >1000 t/s prefill and >20 t/s inference. I’m not chasing faster inference since I actively read the thought process when working the LLM - but I’ve increased ub to get faster prefill since that’s just waiting time otherwise.
./llama-server
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
-c 160000 \
-n 32768 \
-fa on \
-ub 2048 \
-ctk q8_0 \
-ctv q8_0 \
--no-mmap \
--mlock \
--no-warmup \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--host 0.0.0.0
I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.



Ryzen 5950X with a slight underclock
2x32+2x16 DDR4 2666
Radeon 7900XTX 24GB with a 250W power limit
15tps is entirely usable as an agent, but I can’t go above 131k ctx and must set
--parallel 1to fit in available RAM+VRAM. It actually still OOMs periodically, as even with nothing else running it only has a couple GB to work with. If that happens with context mostly filled then you’re looking at ten minutes of prompt processing before next token.I’m not about to buy another 64GB pair to replace the 16s, but… I wish I had done that years ago.