Self hosting an LLM for research

@Maroon · edit-2 7 months ago

Self hosting an LLM for research

lemmyvore · 9 months ago

Can they not get a TPU on USB, like the Coral Accelerator or something?

Terrasque · 9 months ago

It’s less the calculations and more about memory bandwidth. To generate a token you need to go through all the model data, and that’s usually many many gigabytes. So the time it takes to read through in memory is usually longer than the compute time. GPUs have gb’s of RAM that’s many times faster than the CPU’s ram, which is the main reason it’s faster for llm’s.

Most tpu’s don’t have much ram, and especially cheap ones.