LLM ASICs on USB sticks?

@[email protected] · 8 months ago

LLM ASICs on USB sticks?

JackGreenEarth · 8 months ago

I only need ~4 GB of RAM/VRAM for a 7B model, my GPU only has 6GB VRAM anyway. 7B models are smaller than you think, or you have a very inefficient setup.

@[email protected] · 8 months ago

That’s weird, maybe I actually am doing something wrong. Is it because I’m using GGUF models maybe?

@[email protected] · 8 months ago

llama2 gguf with 2bit quantisation only needs ~5gb vram. 8bits need >9gb. Anything inbetween is possible. There are even 1.5bit and even 1bit options (not gguf AFAIK). Generally fewer bits means worse results though.

@[email protected] · 7 months ago

Yeah, I usually take the 6bit quants, didn’t know the difference is that big. That’s probably why tho. Unfortunately, almost all Llama3 models are either 8B or 70B, so there isn’t really anything in between but I find Llama3 models to be noticeably better than Llama2 models, otherwise I would have tried bigger models with lower quants.