[Suggestion] RAG for Perchance Text Generation AIs

@ufl · edit-2 2 months ago

[Suggestion] RAG for Perchance Text Generation AIs

@j4k3 · edit-2 2 months ago

If you do not have the RAM to load a Mixtral 8×7B Q4, look into setting up deepspeed. Once the model is actually loaded, that runs about like a 13B, but with nearly the attention of a 70B. I run either a 8×7B Q4K or a 70B Q4L on a 16GB GPU and 12th gen i7 w/64GB system memory. That does not require deepspeed to load. The 70B is only marginally better, but it is a little slower than my fastest reading pace. The Mixtral model is much faster and that is a large enough model to stay coherent. Your softmax settings per model are very important too.

@ufl · 2 months ago

Thanks for this tip, I don’t have a lot of VRAM just 64GB of regular RAM, but I don’t mind waiting for output :)

But anyway, all non-Llama model weren’t so good and using RAG in plug-and-play mode, probably I should’ve spent more time working on system prompt and jinja as well as RAG curation to squeeze all juices, but I wanted something quick and easy to setup and for this needs Llama 3.2 8B Instruct was the best. I used default setup for all models and same system prompt.

Also, new Qwen reasoning model was good, it was faster in my setup, but was too “independent” I guess, it tended to ignore instructions from system prompt and other settings, while Llama was more “obedient”.

[Suggestion] RAG for Perchance Text Generation AIs

[Suggestion] RAG for Perchance Text Generation AIs

What Is Retrieval-Augmented Generation aka RAG?