Selfhosted LLM (ChatGPT)

@autopilot · edit-2 2 years ago

Selfhosted LLM (ChatGPT)

@CeeBee · 2 years ago

There’s an average correlation between the models parameters and the execution precision being used (eg. 7b parameters at f16 precision). And then using optimized execution for 8 bit or even 4 bit will reduce memory usage and increase execution time.

It’s entirely dependent on the model, the framework, the hardware (CPU vs GPU).

Generally there should be some indication somewhere in the model’s repo that states what you need.