- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
Hi, I’ve been playing with some AI models on my machine in GPT4ALL software and it have this thing called “LocalDocs”.
It looks like it is just a RAG for AI and simple structured text document is more that enough for casual usage. Document like this was more than enough for Llama 3.x to be aware of dates and current state:
{
"Today": "12-Dec-2024",
"Tomorrow": "13-Dec-2024",
"Deadline": "23-Dec-2024",
"Tasks Left": [
{"Task Name": "Get groceries"},
{"Task Name": "Buy presents"}
]
}
Can perchance have something like that or is it up to generator creator to setup RAG?
In my experience, text AIs tend to ignore or forget information very quickly. Before setting RAG up I was constantly correcting AI about everything, but after getting RAG, AI worked flawlessly.
Also, RAGs seems to not increase context size. It looks like AI just uses it during generation and then forgets, so context is increased only by AIs reply.
To sum up, I’ve found this thing very useful, it will be super helpful for all text generators, especially for generators like where AI must be aware of some persistent context like world rules, story characters, etc. Here’s couple of examples:
- https://perchance.org/ai-character-chat
- https://perchance.org/ai-story-generator
- https://perchance.org/ai-generated-hierarchical-world
But all generator authors and users will benefit from this.
Note: If decided to implement, please don’t make it as a file upload. This is how RAG is implemented in LM Studio and it is really annoying to delete previous document and upload new one. Live editor is significantly better.
If you do not have the RAM to load a Mixtral 8×7B Q4, look into setting up deepspeed. Once the model is actually loaded, that runs about like a 13B, but with nearly the attention of a 70B. I run either a 8×7B Q4K or a 70B Q4L on a 16GB GPU and 12th gen i7 w/64GB system memory. That does not require deepspeed to load. The 70B is only marginally better, but it is a little slower than my fastest reading pace. The Mixtral model is much faster and that is a large enough model to stay coherent. Your softmax settings per model are very important too.
Thanks for this tip, I don’t have a lot of VRAM just 64GB of regular RAM, but I don’t mind waiting for output :)
But anyway, all non-Llama model weren’t so good and using RAG in plug-and-play mode, probably I should’ve spent more time working on system prompt and jinja as well as RAG curation to squeeze all juices, but I wanted something quick and easy to setup and for this needs Llama 3.2 8B Instruct was the best. I used default setup for all models and same system prompt.
Also, new Qwen reasoning model was good, it was faster in my setup, but was too “independent” I guess, it tended to ignore instructions from system prompt and other settings, while Llama was more “obedient”.