Exactly, I’m in the same situation now and the 8GB in those cheaper cards don’t even let you run a 13B model. I’m trying to research if I can run a 13B one on a 3060 with 12 GB.
I also have a 3060, can you detail which framework (sglang, ollama, etc) you are using and how you got that speed? i’m having trouble reaching that level of performance. Thx
Exactly, I’m in the same situation now and the 8GB in those cheaper cards don’t even let you run a 13B model. I’m trying to research if I can run a 13B one on a 3060 with 12 GB.
I’m running deepseek-r1:14b on a 12GB rx6700. It just about fits in memory and is pretty fast.
You can. I’m running a 14B deepseek model on mine. It achieves 28 t/s.
You need a pretty large context window to fit all the reasoning, ollama forces 2048 by default and more uses more memory
Oh nice, that’s faster than I imagined.
I also have a 3060, can you detail which framework (sglang, ollama, etc) you are using and how you got that speed? i’m having trouble reaching that level of performance. Thx
Ollama, latest version. I have it setup with Open-WebUI (though that shouldn’t matter). The 14B is around 9GB, which easily fits in the 12GB.
I’m repeating the 28 t/s from memory, but even if I’m wrong it’s easily above 20.
Specifically, I’m running this model: https://ollama.com/library/deepseek-r1:14b-qwen-distill-q4_K_M