Apple is reportedly planning a big AI-focused M4 Mac upgrade

@[email protected] · 11 months ago

Apple is reportedly planning a big AI-focused M4 Mac upgrade

@QuadratureSurfer · 11 months ago

Depends on your work, what you’re trying to do, and how you use it.

As a developer I run my own local version of Dolphin Mixtral 8x7B (LLM) and it’s great at speeding up my productivity. I’m not asking for it to do everything all at once but usually just small snippets here and there to see if there’s a better or more efficient way.

I, for one, am looking forward to hardware improvements that can help us run larger models, so news like this is very welcome.

But you are correct, a large number of companies misunderstand how to use this technology when they should really be treating it like someone at an intern level.

It’s great to give small and simple (especially repetitive) tasks, but you’ll still need to verify everything.

grogman · 11 months ago

Hey, I might give Dolphin Mixyral a try. Do you know where I might install it?
Also, are you a web dev?

@QuadratureSurfer · 11 months ago

Well that’s a loaded question.

There are probably some websites that let you try out the model while they run it on their own equipment (or have it rented out through Amazon, etc.). But the biggest advantage to these models is being able to run it locally if you have the hardware to handle it (beefy GPU for quicker responses and a lot of RAM).

To quickly answer your question, you can download the model from here:
https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF
I would recommend Q5_K_M.

But you’ll also need some software to run it.

A large number of users are using “Text-Generation-WebUI” https://github.com/oobabooga/text-generation-webui
There’s also “LM Studio” https://lmstudio.ai/
Ollama https://github.com/ollama/ollama
And more.

I know that LM Studio supports Both NVIDIA and AMD GPUs.
Text-Generation-WebUI can support AMD GPUs as well, it just requires some additional setup to get it working.

Some things to keep in mind…
Hardware requirements:
- RAM is the biggest limiting factor with which model you can run while your GPU/CPU will decide how quickly the LLM can respond.
- If you can fit the entire model inside of your GPU’s VRAM you’ll get the most speed. In this case I would suggest using a GPTQ model instead of GGUF https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GPTQ
- Even the newest consumer grade GPUs only have 24GB of VRAM right now (RTX 4090, RTX 3090, and RX 7900 XTX). And the next generation of Consumer GPUs are looking like they will be capped at 24GB of VRAM as well unless AMD decides this is their way of competing with NVIDIA.
GGUF models let you compensate for VRAM limitations by loading the model first in VRAM and anything leftover will get loaded into system RAM.

Context Length: Think of an LLM like something that only has a fixed amount of short term memory. The bigger you set the context length, the more short term memory you can give it (the maximum length you can set depends on the model you’re using and setting it to the max also requires more RAM). Mixtral 8x7B models have a Max context length of 32k.