Is it just memory bandwidth? Or is it that AMD is not well supported by pytorch well enough for most products? Or some combination of those?
The memory bandwidth stinks compared to a discrete gpu. That’s the reason. It’s still possible.
The question is, though, would it be better than just a CPU with lots of RAM?
Yes, it seems so according to this person’s testing: https://youtu.be/HPO7fu7Vyw4
Ultimately, it is all about data throughput to the CPU caches because tensors are so large. The M2 claims a 128 bit bus. The instruction support for ARM built into llama.cpp is weak compared to x86. If you want to run big models that require lots of memory, without spending five figures, find a Intel chip that supports AVX-512 and has support for 96+ GB of ram. AVX-512 and the related sub commands are directly supported in llama.cpp and that gets you 512 bit instructions. Apple can’t match that.
If you want a laptop, get something with a 3080Ti. It needs to specifically be the Ti version. This has 16GBV ram and came in several 2022 models.
Run Fedora with it. They have Nvidia support including a slick script that builds the GPU driver from source with every kernel update automatically, and keeps secure boot working all the time.
The instruction support for ARM built into llama.cpp is weak compared to x86.
I don’t know about you but my M1 Pro is a hellovalot faster than my 5800x in llama.cpp.
These CPUs benchmark similarly across a wide range of other tasks.
deleted by creator
No consumer AMD hardware is on that list.
*No consumer Intel hardware is on that list.
The only widely available consumer hardware with AVX512 support is AMD’s Zen4 (7000 series).
I think just about the only Apple computer that supports AVX512 is the 2019 mac pro.
deleted by creator
I run exllama on a 24GB GPU right now, just seeing what’s feasible for larger models – so an intel CPU with lots of RAM would in theory outperform an AMD iGPU with the same amount of ram allocated as VRAM? (I’m looking at APU/iGPUs solely because you can configure the amount of VRAM allocated to them.
I’m pretty sure it is not super relevant. The amount of vram in a GPU is different than the amount in a CPU. The system memory with x86 is mostly virtual bits. I haven’t played in this space in awhile, and so my memory is rusty. The system memory is not directly accessible by an address bus. It creates a major bottleneck when you need to access a lot of information at once. It is more of a large storage system that is made to move chunks of data that are limited in size. If you want more info read about address buses and physical/virtual buses: https://en.m.wikipedia.org/wiki/Physical_Address_Extension
In a GPU, the goal is to move data in parallel where most of the memory is available at the same time. This doesn’t have the extra overhead of complicated memory management systems. Each small processor is directly addressing the memory it needs. With a GPU, more memory usually means more physical compute hardware .
If you ever feel motivated to build vintage computing hardware like Ben Eater’s 8 bit bread board computer project on YouTube, or his 6502 stuff, you’ll see a lot of this first hand. The early 8 bit computer stuff is when a lot of this memory bus and address space was a major design aspect that is much more clear to understand because it is manually configured in hardware external to the processor.
As per the link (YouTube) in the other thread, it seems like iGPU + increased allocation of VRAM is better than using the CPU, though it also seems APUs max out at 16GB. Maybe something AMD can improve in the future then…
What’s the memory bandwith on the AMD platform?
I’ve gotten LLAMA running locally during CLBlast on an AMD GPU, and using the CPU simultaneously (basically APU execution pathway)
AMD is seriously slacking when it comes to machine learning, the hardware is Uber powerful, but just like everyone complains about, software isn’t there.
ROCM doesn’t even work on Windows, FFS.
You can run models on almost anything but the token generation is extremely slow. Like, you might be waiting upwards of 5 minutes for a response, or something like 0.2-0.6/tokens per second, which for a minimum of 100 tokens to be coherent is abysmal.
Isn’t windows for gaming and weird proprietary applications like photoshop?
If you’re using llama.cpp, some ROCM stuff recently got merged in. It works pretty well, at least on my 6600. I believe there were instructions for getting it working on Windows in the pull.
Thank you so much! I’ll be sure to check that out / get it updated