A new arxiv paper explores the Focused Transformer: Contrastive Training for Context Scaling

From the paper:

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a 256k context length for passkey retrieval.

A ChatGPT-4 summary of the FoT approach:

Large language models boast an exceptional capacity to contextualize new information. However, their full potential often faces constraints due to a limitation in the effective context length. This problem can be partly addressed by enabling an attention layer to access an external memory, consisting of (key, value) pairs. But as the number of documents grow, the model may veer towards focusing on irrelevant keys rather than the useful ones, due to a decrease in the proportion of relevant keys.

This leads to a significant challenge we’ve termed the ‘distraction issue’, where keys associated with differing semantic values may overlap, making them hard to differentiate.

In response to this challenge, we present the Focused Transformer (FoT), a method utilizing a contrastive learning-inspired training process. This innovative approach refines the structure of the (key, value) space, thus enabling an expansion of the context length.

The FoT method enables the fine-tuning of existing large-scale models, lengthening their effective context. As a case in point, teams have fine-tuned the 3B and 7B OpenLLaMA models, and the subsequent models showing remarkable progress in tasks requiring an extended context. Notably, LongLLaMA models expertly handle a 256k context length for passkey retrieval.

We’ve used OpenLLaMA-3B and OpenLLaMA-7B models trained for 1T tokens as starting points and fine-tuned them with FoT. The result? Our LongLLaMA models can extrapolate beyond their training context length, even up to 256K, while retaining the performance on short-context tasks.

Here are more figures: LongLLaMA achieves a context accuracy of 95% for 100k and a still significant 73% at 256k. These advancements indicate a notable breakthrough for language models and their contextual capacity.

FoT offers a practical and potent method to extend the effective context length of existing large models, paving the way for a future where language models can more intelligently and comprehensively handle extended contextual information.