Abstract: Instruction-tuned Large Language Models (LLMs) have achieved breakthrough results, opening countless new possibilities for many practical applications. However, LLMs lack elementary safety features that are established norms in other areas of computer science, such as the separation between instructions and data, causing them to malfunction or rendering them vulnerable to manipulation and interference by third parties e.g., via indirect prompt/command injection. Even worse, so far, there is not even an established definition of what precisely such a separation would mean and how its violation could be tested. In this work, we aim to close this gap. We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also introduce a new dataset, SEP (Should it be Executed or Processed?), which allows estimating the measure, and we report results on several state-of-the-art open-source and closed LLMs. Finally, we quantitatively demonstrate that all evaluated LLMs fail to achieve a high amount of separation, according to our measure. The source code and SEP dataset are openly accessible at this https URL.
Lay summary (by Claude 3 Sonnet): Large language models (LLMs) have achieved remarkable results, but they lack fundamental safety features that separate instructions from data, making them vulnerable to manipulation and interference. This work introduces a formal measure to quantify this instruction-data separation and an empirical variant that can be computed from a model’s outputs. The authors also introduce a new dataset, SEP (Should it be Executed or Processed?), to estimate this measure. Their results demonstrate that several state-of-the-art open-source and closed LLMs fail to achieve a high level of separation according to their measure, highlighting the need for improved safety features in these powerful models.