The main reason I ask, is because my current favorite model is a Llama 2 70B Q4_1 GGML model quantized by The Bloke. Here’s the thing though, it was labeled as “Instruct” but it defaults to chat in settings in Oobabooga/Textgen. Every other model I have tried to use for technical help and python/bash snippets has failed to meet my expectations for (skeptically acceptable) accuracy. This 70B is powerful enough that I can prompt it to generate code snippets, and if the code creates an error, by pasting the error into the prompt, it almost always generates a solution in a single correction. Other models I have tried to use this paste-error technique on often crash, ‘dig in their heels’ insisting they are correct, or fail in several different ways like over fitting that forces resetting context tokens.
For whatever reason, the specific 70B model I am using has far exceeded my expectations, but I must use it with very specific conditions in Oobabooga/Textgen. It must be set to: chat, llama.cpp, the “divine intellect” perimeter preset, and the character profile set to the default of “None.”
For whatever reason, deviation from these settings ruins the accuracy of code snippets. Speculatively/intuitively, if I try to use the instruct prompt, or a new persistent character profile, it seems like there is an issue in the way the previous context is handled. In a single session the context seems to drift. In any case, code seems to always have errors and paste corrections fail.
I can’t contextualize this issue with such large models. I have had the same issues with smaller models regardless of settings I have tried. I have written or modified a dozen scripts between bash and python using this 70B in chat mode. It is a bit of a pain because the prompt input/output is not proper markdown for code so I have to correct for whitespace scope and have a reasonable understanding of the code syntax, but for the most part, I don’t need to make corrections to specific lines of output. Is this rare, an issue/quirk with: the model quantization, llama.cpp, Textgen, other? Has anyone else experienced something like this? Am I just super lucky to have found a chance combination that works really well at snippets combined with my prompting/coding skill level? I haven’t had much success with the code specific LLMs either. I’m not sure why this model is doing so well for me.
I haven’t got any experience with the 70B version specifically but based on my experience with LLaMa 2 13B (still annoyed that there’s no 30B version of v2…) it is more sensitive to promoting variations than other models as it isn’t specifically trained for “chat”, “instruct”, or “completion” style interactions. It is capable of all three but without using a clear prompt and template it can be somewhat random as to what kind of response you will get.
For example, using
as the prompt will get results varying from a written article to “The user’s response to an article about [subject] is” to “My response to this request is to ask the user about [clarifying questions]” to “One possible counterargument to an article about [subject] is” to literally the text “Generating response, please wait… [random URL]”. Whereas most conversationally-fine-tuned models will understand and follow this template or other similar templates and play their side of the conversation even if it doesn’t match exactly what they were trained on.
I would recommend using llama.cpp (or the Python binding) directly for more awareness of and control over the exact prompt text as seen by the model. Or using text-generation-webui in “notebook” mode (which just gives you a blank text box that both you and the LLM will type into and it’s up to you to provide the prompt format). This will also avoid any formatting issues with the chat view in text-generation-webui (again I don’t have any specific experience with LLaMa 2 70B but I have encountered times when models don’t output the markdown code block tags and text-generation-webui will mess up the formatting).
Note that for some reason the difference between
chat
,instruct
, andchat-instruct
modes in text-generation-webui are confusingly named.instruct
mode does not include an “instruction” (e.g. “Continue the conversation”) before the conversation unless you include one in the conversation template (the conversation template is referred to as “Instruction template” in the UI).chat-instruct
mode includes an instruction such as “Continue the conversation by writing a single response for Assistant” before the conversation, followed by the conversation template.chat
andchat-instruct
modes also include text that describes the character that the model will speak as (mostly used for roleplay but the default “None” character describes a generic AI assistant character - it is possible that the inclusion of this text is what is helping LLaMa 2 stay on track in your case and understand that it is participating in a conversation. I’m not sure what conversation templatechat
mode uses but afaik it is not the same turn template as set ininstruct
andchat-instruct
modes and I don’t see an option to configure it anywhere.Thanks for the insight. I’m slowly working my way into the codebase for Textgen, and will hopefully get to the point where I can directly use the command line for prompting.
I’ve tried to figure out how the various GUI front ends are able to generate consistent characters and keep the role play consistent. I’ve played with chat, chat-instruct, instruct, and notebook, but I didn’t have a reliable baseline model or fundamental understanding to effectively assess them. Do you happen to know if the prompt processing differences in Textgen, and others like Kobold, are all arbitrary processing done before llama.cpp is called (or some similar code), or is there some other API level that more complex character prompts are tapping into? I’m also super curious how text completion is done in practice. I’m aware I’m blindly walking into this space with my arms out trying to find the walls; aware, but unworried about giant potential holes in the floor.
Typically, using Textgen/chat with the 70B, I just need to start my first question with
In python, how do I....
Starting the prompt like this is critical. If a following prompt introduces a new python module and very different code, I will add the sameIn python,..
beginning. However, if the questioning and code base are somewhat related, I do not continue to add the language context token to my prompt. I am looking for drift, repetition, or any other oddities as a sign that I need to reset the context tokens. Like if the reply changes writing perspective context arbitrarily, I need to recall the last question, alter it, and regenerate. If the reply is the same, I know the context tokens are ruined. I have no idea what this is called. I just know based on intuitive pattern matching that this is a sign things are about to go wonky.I’m half writing this for any insight anyone can help with, but also to share what I have gotten to work well enough for repeatable and useful results. Some of these elements can be the subtle difference between first time success and failure for someone new. Thanks again for the info.
The llama.cpp Python API is super simple to use and you don’t need to dig into the text-generation-webui codebase at all. Literally just:
import llama_cpp_cuda as llama_cpp # use llama_cpp_cuda version for support for running GGML models on the GPU model = llama_cpp.Llama(model_path="", seed=-1, n_ctx=2048, n_gpu_layers=28, low_vram=True) # use whatever settings here that you would set in text-generation-webui when loading the model, make sure to include n_gqa=8 when using LLaMa v2 70B model # now you can either do things with the "all-in-one" API... text = model.create_completion(prompt, max_tokens=200, temperature=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1) # you pass your temperature, top_p, top_k, etc. settings here, these are the same as the settings in text-generation-webui, note that you don't need to pass all the parameters e.g. you can leave out mirostat parameters if you aren't using mirostat mode # ...or the "manual" way prompt_tokens = model.tokenize(prompt.encode('utf-8')) model.reset() model.eval(prompt_tokens) generated_tokens = [] while True: next_token = model.sample(temp=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1) if next_token != model.token_eos(): generated_tokens.append(next_token) model.eval([next_token]) else: break text = model.detokenize([generated_tokens]).decode('utf-8')
See the documentation here for more information: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ You only really need to pay attention to
__init__()
,tokenize()
,detokenize()
,reset()
,eval()
,sample()
, andgenerate()
.create_completion()
provides an “all-in-one” wrapper around eval/sample/generate that is intended to be (loosely) compatible as a drop-in replacement for the OpenAI Python library.create_chat_completion()
is likewise intended to be a replacement for OpenAI but if you want direct control over the prompt format then ignore it entirely (it’s not even documented exactly how the prompt is formatted when using this function…).They are not doing anything special with the model (no fancy API or anything). All they are doing is including some extra text before your input that describes the characters, scene etc. and possibly a direct instruction to roleplay as that character, and then sending that combined assembled prompt to the model/backend API as you would with any other text. Unfortunately the documentation isn’t particularly transparent about how the extra text is included (with regards to the exact formatting used, what order things appear in, etc.) and neither do the logs produced by e.g. text-generation-webui include the actual raw prompt as seen by the model.
The key point to understand here is that all current LLMs (this may change in the future) work only with raw text. They take in some text and then generate other text that goes after it. Any more complex applications such as conversation are just layers built on top of this. The conversation is turned into a plain-text transcript that is sent to the model. The model generates the next part of the conversation transcript, which is then parsed back out and appended to the list of conversation messages. From the model’s perspective, it’s all just one continuous stream of raw text. You can always achieve exactly the same results by manually constructing the same prompt yourself and passing it directly to the model.
For example, if I pass the following string as the prompt into
model.create_completion()
from above"### User:\nPlease can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block.\n\n### Response:\n"
I will get exactly the same result as if I used
instruct
mode in text-generation-webui with### User:
as the user string,### Response:
as the bot string, and<|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n
as the turn template, and then sent the message “Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block.” in the chat box.(Although imo doing it the manual way is less error-prone and guaranteed to give me exactly the prompt that I think I should be getting, noting that text-generation-webui doesn’t give me any way at all to actually verify that the prompt seen by the model is actually the way I intended it to be and it’s not as though I haven’t encountered UI bugs before where the produced formatting doesn’t match what I entered…)
You don’t necessarily need to alter your question in that case, often just regenerating is enough to “fix” this. This is, as I have said, particularly an issue with the LLaMa 2 non-chat models as they aren’t specifically trained to follow a conversation, so sometimes they will arbitrarily decide to provide a commentary or reaction to the conversation or they see the conversation as part of a webpage and try to generate a heading for the next part of an article or some other such seemingly-“random” behavior instead of continuing the conversation itself. If that happens just regenerate the response until the RNG works out in your favor and the model starts writing in the correct role. Once it starts writing a particular “type” of output it will generally keep writing in the same role until it has finished.
Sometimes it is also helpful to write the first part of the response yourself. For example, you could write “Sure! Here is a program that does <summary>” (try to copy the particular style used by a particular model) and then let the model continue from there (there’s an option in text-generation-webui labeled “Start reply with” that does this, or if you’re constructing the prompt yourself then this is trivial to accomplish - make sure to not include a space or newline after the part that you’ve written). This will make it more likely to write a program for you instead of providing a commentary like “The user has asked the assistant to write a program. It is possible that someone may respond to such a request by …”.
This seems to be (sort of) a known issue with LLaMa 2 specifically, where it will keep regenerating the previous response even though you continue the conversation. It’s not exactly clear what causes this, it’s not a software bug in the traditional sense. The model is receiving your follow-up message but it’s just deciding to repeat whatever it said last time instead of saying something different. This is believed to possibly be an issue with how the training data was formatted.
This might make more sense if you think of this in terms of what the model is seeing. The model is seeing something such as the following:
### User: Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block. The hash should be written to a file with the name ".blockhashes." (index is padded to 5 digits). ### Response: Certainly! Here's an example program that does what you described: [33-line code snippet removed] This program takes two arguments: the input file and the output directory. It first calculates the number of blocks needed to store the entire file, and then loops over each block, reading it from the input file and calculating its SHA256 hash. The hash is written to a separate file with the format `.blockhashes.`. I hope this helps! Let me know if you have any questions or need further clarification. ### User: Please can you fix the following two issues with your program: * The output filename must have the block index padded to 5 digits. * The output file must contain only the SHA256 hash in hex form and no other text/contents. Please write out only the parts of the program that you have changed. ### Response:
At this point, the model sees the heading
### Response:
. For some reason, the LLaMa 2 models have an over-tendancy to refer back in the text and see that last time the text### Response:
was followed by the textCertainly! Here's an example program that does what you described:
and so they will then repeat that exact same text again because the model has concluded that### Response:
should now always be followed byCertainly! Here's an example program that does what you described:
instead of seeing the higher-level view where### User:
and### Response:
are taking turns in a conversation.If this happens, you don’t always need to clear/reset the conversation. Often, you can just regenerate it a few times and once the model starts writing a different response it will continue into something else other than repeating the same text as before. As with the previous point it can also help if you write the first part of the response yourself to force it to say something different.</summary>
Many thanks. I got it mostly working today. At least, I got it working with llama_cpp. I haven’t gotten llama_cpp_cuda working. I used the same conda environment and distrobox as Oobabooga/Textgen. I tried (re)installing with pip and conda, but there is always some weird missing dependency (not at comp now where I can say exactly which). I tried searching for the libraries that were called out in the error but got no results, internet search had no good results, and even old trusty 70B had no helpful advice. On anaconda.org it had no results, but said something about not showing private libraries without logging in, whatever that means. With pip list there is an entry for something like llama_python_cpp_cuda. It is probably what I need but I’m not sure yet.
The results I got weren’t great, but it is a starting point. I was surprised how most models are around the same speed as Textgen. I can see the potential, but at the same time, for my needs thus far, just having Textgen and the 70B in chat mode all the time is handy and fast, even if I must filter through the poor formatting.
I should have said before, I was exploring the source code of extensions and overall, looking at how stuff works. Your explanation and guidance was too good of an opportunity to pass up. Getting this working on the command line was definitely on my list to try. I spent most of the day reading about the API and asking the 70B to explain stuff. I still have lots to figure out. Like setting up a simple script with terminal input got some odd ball responses several times. The correct answer was in the output but so was an extra half dozen random sentences; most completely unrelated. I think it had to do with a lack of prompt structure. I’m not sure how to format that in the terminal itself. I can do it in the script. It is the obvious solution, I just didn’t get around to it today.
I only used the part above
# ...or the "manual" way
I’m not clear on what the second part is doing exactly or the I/O. What I tried just seemed to loop with no output. I’m sure I’m doing something stupid that is causing the issue.
prompt_tokens = model.tokenize(prompt.encode('utf-8')) model.reset() model.eval(prompt_tokens) generated_tokens = [] while True: next_token = model.sample(temp=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1) if next_token != model.token_eos(): generated_tokens.append(next_token) model.eval([next_token]) else: break text = model.detokenize([generated_tokens]).decode('utf-8')
Thanks again.