Welcome to the Llama-2 FOSAI & LLM Roundup!

(Summer 2023 Edition)

Hello everyone!

The wave of innovation I mentioned in our Llama-2 announcement is already on its way. The first tsunami of base models and configurations are being released as you read this post.

That being said, I’d like to take a moment to shoutout TheBloke, who is rapidly converting many of these models for the greater good of FOSS & FOSAI.

You can support TheBloke here.

Below you will find all of the latest Llama-2 models that are FOSAI friendly. This means they are commercially available, ready to use, and open for development. I will be continuing this series exclusively for Llama models. I have a feeling it will continue being a popular choice for quite some time. I will consider giving other foundational models a similar series if they garner enough support and consideration. For now, enjoy this new herd of Llamas!

All that you need to get started is capable hardware and a few moments setting up your inference platform (selected from any of your preferred software choices in the Lemmy Crash Course for Free Open-Source AI or FOSAI Nexus resource, which is also shared at the bottom of this post).

Keep reading to learn more about the exciting new models coming out of Llama-2!

8-bit System Requirements

Model VRAM Used Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 9.2GB 10GB 3060 12GB, 3080 10GB 24 GB
LLaMA-13B 16.3GB 20GB 3090, 3090 Ti, 4090 32 GB
LLaMA-30B 36GB 40GB A6000 48GB, A100 40GB 64 GB
LLaMA-65B 74GB 80GB A100 80GB 128 GB

4-bit System Requirements

Model Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 6GB GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060 6 GB
LLaMA-13B 10GB AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 12 GB
LLaMA-30B 20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 32 GB
LLaMA-65B 40GB A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 64 GB

*System RAM (not VRAM), is utilized to initially load a model. You can use swap space if you do not have enough RAM to support your LLM.


The Bloke

One of the most popular and consistent developers releasing consumer-friendly versions of LLMs. These active conversions of trending models allow for many of us to run these GPTQ or GGML variants at home on our own PCs and hardware.

70B

13B

7B

LLongMA

LLongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling.

13B

7B

Also available from The Bloke in GPTQ and GGML formats:

7B

Puffin

The first commercially available language model released by Nous Research! Available in 13B parameters.

13B

Also available from The Bloke in GPTQ and GGML formats:

13B

Other Models

Leaving a section here for ‘other’ LLMs or fine tunings derivative of Llama-2 models.

7B


Getting Started w/ FOSAI!

Have no idea where to begin with AI/LLMs? Try starting here with UnderstandGPT to learn the basics of LLMs before visiting our Lemmy Crash Course for Free Open-Source AI

If you’re looking to explore more resources, see our FOSAI Nexus for a list of all the major FOSS/FOSAI in the space.

If you’re looking to jump right in, visit some of the links below and stick to models that are <13B in parameter (unless you have the power and hardware to spare).

FOSAI Resources

Fediverse / FOSAI

LLM Leaderboards

LLM Search Tools

GL, HF!

If you found anything about this post interesting - consider subscribing to [email protected] where I do my best to keep you in the know about the most important updates in free open-source artificial intelligence.

I will try to continue doing this series season by season, making this a living post for the rest of this summer. If I have missed a noteworthy model, don’t hesitate to let me know in the comments so I can keep this resource up-to-date.

Thank you for reading! I hope you find what you’re looking for. Be sure to subscribe and bookmark this page if you want a quick one-stop shop for all of the new Llama-2 models that will be emerging the rest of this summer!

  • @j4k3
    link
    English
    2
    edit-2
    1 year ago

    You should add that the GGML versions exist and will run on a CPU without any VRAM for most of the stuff 13B and below.

    I just started playing with Stable Diffusion, then Llama-2 7B on a 16GBV, and had plenty of room to spare on a 3080Ti. The Llama2 @ 7B is a nice little chat bot, but it is a dirty lier most of the time, okay like half of the time. So it is technically useless but emotionally interesting. The Llama2 13B is much more technically useful with around 20% hallucinations from me asking super obscure questions intentionally.

    Here’s the thing I wish I knew before, the webui Oobabooga used for GPT is compatible with Stable Diffusion. I can barely run a 7B and a smaller 2B model in SD on a 16GBV GPU and have enough space left for context tokens (chat history basically). However, after noticing the models on hugging face that had GGML at the end and the notes about them working with CPUs and or GPUs. I started messing with them. On a 12gen i7 laptop I can run a 7B nearly as fast as I can on the graphics card. It’s only like 1.5 vs 2.5 seconds per token. It is slow but no big deal. I just tried a 13B with only the CPU and it works too. It only takes like 2.5 times longer and it is much more interesting to talk with. I just downloaded a 30B and it is not usable on just the CPU with 6 threads at 4.7ghz Just a simple Hi from me and reply of “Hello! How can I assist you today?” took 49.89 seconds and felt much longer. It is annoying level slow.

    If you want to be super antisocial, and just talk with bots, the 7B on a CPU like mine in totally doable. If the 7B is on the CPU and a small 2-4gb model for Stable Diffusion is running on a GPU it is quite fun to play with.

    • @BlaedOPM
      link
      English
      21 year ago

      Hey, thanks for commenting and sharing your experience. I’ll be adding a terminology table soon, I’ll be sure to include this on there!

      I have been bouncing between GPTQ and GGML models since TheBloke first started releasing them - I have yet to come to a definitive conclusion on which I prefer the most, but if you need the extra overhead, I can see why you’d choose GGML.

      I don’t use text-gen and Stable Diffusion in tandem often, but this will be good to note when I do!