Llama-2 LLM Roundup (Summer 2023 Edition)

@Blaed · edit-2 2 years ago

Llama-2 LLM Roundup (Summer 2023 Edition)

@j4k3 · edit-2 2 years ago

You should add that the GGML versions exist and will run on a CPU without any VRAM for most of the stuff 13B and below.

I just started playing with Stable Diffusion, then Llama-2 7B on a 16GBV, and had plenty of room to spare on a 3080Ti. The Llama2 @ 7B is a nice little chat bot, but it is a dirty lier most of the time, okay like half of the time. So it is technically useless but emotionally interesting. The Llama2 13B is much more technically useful with around 20% hallucinations from me asking super obscure questions intentionally.

Here’s the thing I wish I knew before, the webui Oobabooga used for GPT is compatible with Stable Diffusion. I can barely run a 7B and a smaller 2B model in SD on a 16GBV GPU and have enough space left for context tokens (chat history basically). However, after noticing the models on hugging face that had GGML at the end and the notes about them working with CPUs and or GPUs. I started messing with them. On a 12gen i7 laptop I can run a 7B nearly as fast as I can on the graphics card. It’s only like 1.5 vs 2.5 seconds per token. It is slow but no big deal. I just tried a 13B with only the CPU and it works too. It only takes like 2.5 times longer and it is much more interesting to talk with. I just downloaded a 30B and it is not usable on just the CPU with 6 threads at 4.7ghz Just a simple Hi from me and reply of “Hello! How can I assist you today?” took 49.89 seconds and felt much longer. It is annoying level slow.

If you want to be super antisocial, and just talk with bots, the 7B on a CPU like mine in totally doable. If the 7B is on the CPU and a small 2-4gb model for Stable Diffusion is running on a GPU it is quite fun to play with.

@Blaed · 2 years ago

Hey, thanks for commenting and sharing your experience. I’ll be adding a terminology table soon, I’ll be sure to include this on there!

I have been bouncing between GPTQ and GGML models since TheBloke first started releasing them - I have yet to come to a definitive conclusion on which I prefer the most, but if you need the extra overhead, I can see why you’d choose GGML.

I don’t use text-gen and Stable Diffusion in tandem often, but this will be good to note when I do!

Model	VRAM Used	Minimum Total VRAM	Card Examples	RAM/Swap to Load*
LLaMA-7B	9.2GB	10GB	3060 12GB, 3080 10GB	24 GB
LLaMA-13B	16.3GB	20GB	3090, 3090 Ti, 4090	32 GB
LLaMA-30B	36GB	40GB	A6000 48GB, A100 40GB	64 GB
LLaMA-65B	74GB	80GB	A100 80GB	128 GB

Model	Minimum Total VRAM	Card Examples	RAM/Swap to Load*
LLaMA-7B	6GB	GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060	6 GB
LLaMA-13B	10GB	AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000	12 GB
LLaMA-30B	20GB	RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100	32 GB
LLaMA-65B	40GB	A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000	64 GB

Llama-2 LLM Roundup (Summer 2023 Edition)

Llama-2 LLM Roundup (Summer 2023 Edition)

Welcome to the Llama-2 FOSAI & LLM Roundup!

8-bit System Requirements

4-bit System Requirements

The Bloke

LLongMA

Puffin

Other Models

Getting Started w/ FOSAI!

GL, HF!