# Supersizing Transformers: Going Beyond RAG with Extended minds for

LLMs

Phoebe Klett, Thomas Ahle  
2023-10-24

Today’s popularized large language models are autoregressive models
trained on next token prediction. They’re optimized for the task of
producing sequences of tokens which look like they could’ve been present
in the training corpus. This is quite distinct from the ways in which
LLMs are wielded in such user-interfaces as
[ChatGPT](https://chat.openai.com/?model=gpt-4) or
[Perplexity.ai](https://www.perplexity.ai/), where users expect the
model to perform complex reasoning tasks and faithfully retrieve
factual, topical information. If we hope to use the model as a general
reasoning agent and not as a stochastic parrot, we need to provide it
with any relevant data at inference time, rather than rely on (1) the
salient data having appeared in the training corpus and (2) the model
having perfect recall. But in many cases, we’re faced with documents
longer than the context window of the model.

This has prompted much development on methods colloquially referred to
as “retrieval”. Indeed, retrieval has thus far become a [table
stakes](https://www.sequoiacap.com/article/generative-ai-act-two/) part
of the modeling stack for building LLM apps. However, today’s methods
are still lacking. In particular, popular RAG[1] methods necessitate
clunky system design, and introduce latency and data leakage concerns
which make this solution a non-starter for many enterprise applications.
Equally important, this method decides what to include in the prompt
based on grainy representations disjoint from the model.

In this post, we propose a method **extended mind transformers** which
in many ways are a natural generalization of methods like RAG. This
mathematical generalization buys us performance and quality gains (less
hallucinations!), seamless enterprise integrations, and added
interpretability without introducing conceptual complexity.

<figure>
<img
src="https://storage.googleapis.com/normal-blog-artifacts/extended-mind-transformers/otto.png"
alt="Credits: Buchen (2018)" />
<figcaption aria-hidden="true">Credits: <span class="citation"
data-cites="patrick-blog">Buchen (2018)</span></figcaption>
</figure>

## Aesthetics for Extended Mind Transformers

As motivation, we provide context from the Philosophy of Mind which
served as inspiration for the naming convention and methodology. In
Clark and Chalmers (1998) “The Extended Mind”, they present the thesis
that external information which is constantly and immediately
accessible, and automatically endorsed should be considered part of the
memory. And further, that this extension should be considered part of
the mind. They term this idea active externalism. The story of Otto
functions as an intuition pump:

> “\[L\]ike many Alzheimer’s patients, \[Otto\] relies on information in
> the environment to help structure his life. Otto carries a notebook
> around with him everywhere he goes. When he learns new information, he
> writes it down. When he needs some old information, he looks it up.
> For Otto, his notebook plays the role usually played by a biological
> memory. … The information in the notebook functions just like
> information constituting an ordinary non-occurrent belief; it just
> happens that this information lies beyond the skin.”[2]

In this piece, we present active externalism for LLMs, a mechanism for
bolstering the memory of transformers aesthetically inspired by the
Extended Mind Thesis.

## Active externalism for LLMs

Current popular methods for tackling the short context length of today’s
LLMs include fine tuning[3] and RAG. There’s also been compelling work
done to suggest ways of augmenting the self-attention mechanism to
support various types of “long term memory”.[4]

The active externalism we present in the following section most closely
resembles the work of Wu et al. (2022), where they train the language
model to use an external memory which is itself a non-differentiable
cache of previous key-value pairs. While the authors of this paper
believe the model needs to be trained from scratch or at least
fine-tuned to be able to make sense of the extra retrieved tokens, we
show that using models trained with ALiBi can make sense of these
external key-value pairs innately.

### Definition

Our proposed method is a simple change to the self-attention mechanism.
In addition to the causal self-attention integral to transformers, we
also allow each query token to attend to a fixed number of “external
memories”. The choice of which memories to attend to is made using
cosine similarity within each decoder layer and attention head. More
precisely, our attention computation is described by:

$$
\operatorname{softmax}\left(\frac{Q(K_{R}\oplus K_{L})^{T}}{\sqrt{d}}\right) \times \left(V_{R} \oplus V_{L}\right)
$$

Where $(K_{L}, V_{L})$ are key-value pairs from local context, and
$(K_{R}, V_{R})$ are key-value pairs from external memories, and
$\oplus$ refers to tensor concatenation. We mask the attention weights
such that each query token can only attend to its own retrieved keys,
and not those retrieved by previous or following query tokens. In the
experiments we present below we use models trained with linear biases
rather than positional encodings. When we apply these linear biases to
our attention weights, we assign the same index to all retrieved
memories.[5]

Importantly, **active externalism retrieves memories exactly** - it
doesn’t summarize or otherwise dampen memories except through the linear
biases.

We generate the external memories (key-value pairs) once, and then pass
the representations to each decoder layer in an analogous fashion to
passing previous “cached” key-values[6]. In order to speed up the top-k
cosine similarity computation we can use a vector database designed
exactly for this purpose[7].

We argue that this way of attending to external memories or beliefs is
the natural and optimal generalization of methods like RAG, and closely
mimics the kind of relationship Otto has with his notebook. The
information is constantly and immediately accessible, automatically
endorsed, and reliably referenced. We set a similarity threshold such
that we always reference our external memories (for every generated
token, within all decoder layers), but discard keys which don’t meet
some low similarity threshold (we find .25 to be a good choice) to avoid
confusing the model with irrelevant information.

Active externalism is not conceptually difficult to implement, but does
require getting familiar with a particular model’s implementation since
details like the way key-value pairs are stored and read into the
self-attention computation need to be hijacked.

## Benchmark Results

### Perplexity Experiments

We use perplexity as a metric for our model’s performance with and
without active externalism. Perplexity is a measure of uncertainty of
the model over each generated token, closely related to our cross
entropy loss function. For a full explanation of perplexity as a metric,
we suggest checking out this excellent
[post](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).

We show results below for perplexity experiments using Mosaic’s MPT-7b
model. We use a stride of 512 tokens in our perplexity experiments,
meaning each token is conditioned on at least 512 previous tokens, given
that there are indeed 512 tokens to condition on.

Our active externalism method batches each sequence into chunks of
increasing length (x-axis), and attends to tokens previous to the last
2048 (max sequence length) as external memories. We show results for
varying k, where k is the number of memories we retrieve per query
token. We compare active externalism to two baseline methods. The
“truncated” baseline simply throws out any tokens previous to the last
2048 during perplexity computations, and the “naive” method which uses
all input-length tokens, no matter how long the sequences become.

In the case of the naive method, we observe exactly the phenomenon
active externalism seeks to ameliorate: after sequences exceed lengths
greater than 2-3k tokens, the performance quickly drops off (in this
case, perplexity blows up).

<figure>
<img
src="https://storage.googleapis.com/normal-blog-artifacts/extended-mind-transformers/naive.png"
alt="Perplexity results for Naive and Active Externalism Methods, using MTP-7b and a stride length of 512 tokens. Documents are batched into lengths of “Input Length” and we report average PPL on Y-Axis." />
<figcaption aria-hidden="true">Perplexity results for Naive and Active
Externalism Methods, using MTP-7b and a stride length of 512 tokens.
Documents are batched into lengths of “Input Length” and we report
average PPL on Y-Axis.</figcaption>
</figure>

While we can see that active externalism provide clear benefits over
simply doing local attention, in the case of the truncated benchmark.
Even more exciting, perplexity continues to decrease as we increase the
number of retrieved memories per query token.

<figure>
<img
src="https://storage.googleapis.com/normal-blog-artifacts/extended-mind-transformers/truncated.png"
alt="Perplexity results for Truncated and Active Externalism Methods, using MTP-7b and a stride length of 512 tokens. Documents are batched into lengths of “Input Length” and we report average PPL on Y-Axis." />
<figcaption aria-hidden="true">Perplexity results for Truncated and
Active Externalism Methods, using MTP-7b and a stride length of 512
tokens. Documents are batched into lengths of “Input Length” and we
report average PPL on Y-Axis.</figcaption>
</figure>

### Retrieval Experiments

We also measure performance on retrieval benchmarks, and compare with
RAG and simple baselines. Our dataset is a modified version of the
recently released [Long context WikiQA
benchmark](https://huggingface.co/datasets/abacusai/WikiQA-Free_Form_QA)
from Abacus.AI.

Our goal is to to measure retrieval abilities over varying document
lengths, but we also want to control for facts memorized during
training, so we edit the dataset by changing the labelled answers to
realistic but wrong answers. I.e, we replace every instance of “Lee
Hazlewood” with “Terry Allen” in the the wikipedia entry for the song
“These Boots Were Made For Walking”, and then ask the model to produce
the song-writer’s name, with the **correct** answer now being “Terry
Allen”. Our intention is to measure the model’s ability to prioritize in
context or in memory facts over those it memorized during training.

Again, we feel this is an important ability if we’re asking LLMs to be
reasoning agents in an evolving world. In the results below, baseline
receives no context at all for the question (we ask it point-blank), RAG
selects the best ~2-3k tokens out of the document to include in-context,
and active externalism puts the entire document in memory and uses it as
Otto uses his notebook.

<figure>
<img
src="https://storage.googleapis.com/normal-blog-artifacts/extended-mind-transformers/retrieval.png"
alt="Retrieval Benchmark Results, by Document Length" />
<figcaption aria-hidden="true">Retrieval Benchmark Results, by Document
Length</figcaption>
</figure>

We see that while RAG methods drop off with input length, active
externalism continues to be effective. While models fine tuned to use
longer contexts do currently outperform active externalism on some
long-range retrieval tasks, active externalism appears to be a more
effective way to do retrieval over long contexts for smaller models.

Where active externalism clearly outperforms RAG in large models is
precisely where the model has [memorized before they
overfiting](https://arxiv.org/pdf/2205.10770.pdf). Or, the model’s
weights encode factual information even as the model’s performance on
test data (usually, as measured by cross-entropy) continue to improve.
Depending on your application, this could be seen as a strength or
shortcoming. Certainly when we use LLMs as reasoning agents, this is
shortcoming.

Using active externalism also appears to eliminate some reliance on
prompting. Whereas usually we’d need to include some examples of the
kind of responses we hope to observe in the prompt (or use a “chat”
model which has been RLHF’ed), we observe experimentally that this isn’t
necessary when using active externalism.

## Impact on reasoning engine

We discuss two important consequences of active externalism on the LLMs
ability as a reasoning agent: uncertainty awareness and abstraction
levers.

If we prompt the model with a question which it’s unsure about (unsure
in an epistemic way, i.e. the model didn’t observe this fact during
training/can’t infer from the context), it may not respond in a way
which is transparent about that uncertainty. We call this a
hallucination[8]. However, as we increase the number of retrieved
memories each query token is allowed to attend to, the model generates
answers which approach the correct answer and then stabilize. This
evolution of generations signals the the model’s original uncertainty,
and we even qualitatively observe the model express its own uncertainty
more using active externalism. Let’s check out an example. We first load
the model, and pass a paragraph from Wikipedia’s entry on Grothendieck
as external memories. We use a smaller text here for a speedy
demonstration.

[1] **RAG**, the most popular method for tackling the short context
length of LLMs in application settings, attempts to identify the most
salient information in a long text for a given query or task, such that
the long context can be cut down to “fit in memory”. This is
accomplished using a choice of sentence embedding which is usually
external to the model, chunking the long text and comparing with the
query vector using a similarity or distance metric. Many [open sourced
projects](%5Bhttps://python.langchain.com/docs/integrations/retrievers%5D)
have made implementing such a strategy easier, and the success of
[“vector
databases”](https://www.forbes.com/sites/adrianbridgwater/2023/05/19/the-rise-of-vector-databases/?sh=4472652914a6)
demonstrates the rapid adoption of such methods. However this method has
some glaring short-comings as well. The logistics of doing retrieval
external to the model make system design clunky, and infeasible for
enterprises concerned about dumping data into a temporary hosted store.
Tradeoff between chunk size and accuracy is hard to tune, as is the
choice of embedding. Perhaps most important: this method measures for
“salience” once on the least granular representations of the data, using
representations distinct from our model’s. More explicitly, **the system
reasoning over the data should be the same system deciding which data is
important for the task**.

[2] Clark and Chalmers (1998)

[3] Although there’s no technical reason we can’t throw an arbitrarily
long sequence into context, performance using today’s models will drop
off quickly after we exceed the sequence length the model saw during
training. This inability to generalize is largely due to the use of
positional embeddings. While originally (in Vaswani et al. (2023)) only
applied once at the beginning of the encoder/decoder stack, in today’s
GPT-style transformers positional encodings are usually incorporated at
the bottom of each decoder layer. These are unique constants which are
either added or multiplied to hidden states in order to encode the index
of each token in the sequence. Unless the model is trained further to
expect a wider range of positional values, these new tokens quickly
become out of distribution. Folks at
[Mosaic](https://www.mosaicml.com/blog/mpt-7b) have combatted this by
using attention with linear biases (as presented by Press, Smith, and
Lewis (2022)) instead of unique positional encodings, but more usually
folks fine tune the model on longer sequences. Fine tuning can be a
non-trivial undertaking, especially for truly large language models, and
even given an infinitely long context, faithfully retrieving facts from
very long sequences remains a challenge. Recent experiments show that
models still struggle to use all the information provided in the larger
context window - often forgetting things in the middle in particular, as
they show in Liu et al. (2023).

[4] The architecture described in Martins, Marinho, and Martins (2022)
continuously compresses long text inputs such that the text always fits
in memory. This has the obvious advantage of supporting input sequences
of “infinite” length, but the weakness of summarizing the past such that
it necessarily contains less detail. A coarse-grained/RAG analog to this
might be using the language model itself to iteratively summarize past
inputs and then passing the summary into context. In Sukhbaatar et al.
(2019), the authors suggest replacing the feed-forward mechanism in each
decoder layer with another attention block, and interpret this “unified
mechanism” as an aggregation of global and contextual information. The
creative contributors in Burtsev et al. (2021) propose introducing a
`[mem]` token which they hope the model will learn to leverage as space
for storing global information. They implement various decoder
architectures which attempt to enforce this with varying strictness.

[5] I.e., the model interprets those retrieved memories as being the
some constant distance away from the tokens it considers local context.
For simplicity’s sake, we choose this constant index to be that directly
following the last in-context index. I.e. if we pass the model a
sequence of 1200 tokens, the memories in context will all be assigned
position 1201. Certainly there’s room to experiment here - for instance
you might choose to bias weights closer to the beginning of the memories
more than those toward the end - but we find this is a reasonable and
effective choice. We hypothesize that these methods will be effective
for models trained with relative positional encodings as well, and will
pursue this end in future work.

[6] a popular mechanism for speeding up inference, as a GPT-style
transformer’s output only depends on the previous inputs

[7] We support using [FAISS](https://github.com/facebookresearch/faiss)
in our implementation

[8] see our [recent
post](https://blog.normalcomputing.ai/posts/2023-10-20-explainability/explainability.html)
for more

In [1]:
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

wikipedia = """Alexander Grothendieck (/ˈɡroʊtəndiːk/; German pronunciation: [ˌalɛˈksandɐ ˈɡʁoːtn̩ˌdiːk] (listen); French: [ɡʁɔtɛndik]; 28 March 1928 – 13 November 2014) was a stateless (and then, since 1971, French) mathematician who became the leading figure in the creation of modern algebraic geometry.[7][8] His research extended the scope of the field and added elements of commutative algebra, homological algebra, sheaf theory, and category theory to its foundations, while his so-called "relative" perspective led to revolutionary advances in many areas of pure mathematics.[7][9] He is considered by many to be the greatest mathematician of the twentieth century.[10][11]

Grothendieck began his productive and public career as a mathematician in 1949. In 1958, he was appointed a research professor at the Institut des hautes études scientifiques (IHÉS) and remained there until 1970, when, driven by personal and political convictions, he left following a dispute over military funding. He received the Fields Medal in 1966 for advances in algebraic geometry, homological algebra, and K-theory.[12] He later became professor at the University of Montpellier[1] and, while still producing relevant mathematical work, he withdrew from the mathematical community and devoted himself to political and religious pursuits (first Buddhism and later, a more Christian vision).[13] In 1991, he moved to the French village of Lasserre in the Pyrenees, where he lived in seclusion, still working tirelessly on mathematics and his philosophical and religious thoughts until his death in 2014.[14]
"""

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
memory_ids = tokenizer(wikipedia, return_tensors='pt')['input_ids']

model = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", external_memories=memory_ids, trust_remote_code=True)

Now, let’s ask the model a question we know is answered (albeit a little
obscurely) in the above paragraph without using active externalism. We
can acheive this by setting the parameter
`model.use_active_externalism = False` or simply passing `topk=0`. Hint:
the correct answer is 1971.

In [3]:
prompt = "When did Alexander Grothendieck get his French citizenship?"
input_ids = tokenizer(prompt, return_tensors='pt')['input_ids']

out = model.generate(input_ids, max_length=input_ids.size(-1)+50, topk=0)
print('Baseline Generation: ', tokenizer.decode(out[0]))

Now let’s enable active externalism, slowly cranking up the number of
memories each query token is allowed to attend to using the `topk`
parameter.

In [4]:
out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=5)
print('Generation for k=5: ', tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=6)
print('Generation for k=6: ',tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=7)
print('Generation for k=7: ',tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=8)
print('Generation for k=8: ',tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+20, topk=30)
print('Generation for k=30: ',tokenizer.decode(out[0][input_ids.size(-1):]).strip())

Not only did the model produce the correct answer, but it also expressed
increasing certainty about its answer.

In cases where the model is certain about the answer, the generations
are stable as we increase k over external context.

In [5]:
prompt = "What was did Alexander Grothendieck's profession?"
input_ids = tokenizer(prompt, return_tensors='pt')['input_ids']

out = model.generate(input_ids, max_length=input_ids.size(-1)+25, topk=0)
print('Baseline Generation: ', tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=2)
print('Generation for k=2: ', tokenizer.decode(out[0][input_ids.size(-1):]).strip())

out = model.generate(input_ids, max_length=input_ids.size(-1)+15, topk=8)
print('Generation for k=8: ', tokenizer.decode(out[0][input_ids.size(-1):]).strip())

A natural extension of this principle might look like the development of
a metric based on similarity or attention weight which could communicate
this uncertainty in a more compact form, work currently under
development at Normal.

The parameter `topk` also serves as a useful lever for the level of
abstraction in the model’s output. E.g., the extent to which we’d like
the model to synthesize the memories vs. quote verbatim from the source.
We see this clearly in question answering tasks over code. We show an
example using the chat model here, which is best equipped to handle more
free-form question answering tasks.

In [6]:
code_snippet = """def sieve_of_eratosthenes(limit):
    sieve = [True] * (limit + 1)
    sieve[0] = sieve[1] = False
    primes = []
    
    for current in range(2, int(limit**0.5) + 1):
        if sieve[current]:
            primes.append(current)
            for multiple in range(current*current, limit + 1, current):
                sieve[multiple] = False
    
    for num in range(int(limit**0.5) + 1, limit + 1):
        if sieve[num]:
            primes.append(num)
    
    return primes
"""
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
memory_ids = tokenizer(code_snippet, return_tensors='pt')['input_ids']

model = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b-chat", external_memories=memory_ids, trust_remote_code=True)

We ask the model to recall what our function does, first with a small
`topk`.

In [8]:
prompt =  "What does the function sieve_of_eratosthenes do?"
input_ids = tokenizer(prompt, return_tensors='pt')['input_ids']

out = model.generate(input_ids, max_length=input_ids.size(-1)+100, topk=2)
print(tokenizer.decode(out[0]))

We see that with a small `topk` the model abstracts away the details
from the code, providing example usage and then a natural language
description of what the code does. Now let’s try with a larger `topk`.

In [9]:
out = model.generate(input_ids, max_length=input_ids.size(-1)+100, topk=14)
print(tokenizer.decode(out[0]))

Now the model outputs almost verbatim from the code. This is the kind of
nuanced stylistic choice is very hard to to achieve in a using naive
prompting and RAG methods without developing many point solutions
specific to the data and prompt. More importantly, these kind of
experiments give us small clues into how the model actually reasons over
these key-values pairs. At Normal, we hope to combine work on
mechanistic interpretability methods with extended mind transformers,
building a unified system for understanding how models store facts and
reason over them.

## Explainability

Clark and Chalmers write in their paper: “By embracing an active
externalism, we allow a more natural explanation of all sorts of
actions”, and indeed this true for our active externalism as well. Using
attention weights, we can highlight which memories were used during each
generation step. Here we highlight the memories used when generating the
correct token “1971”. Since we retrieve memories per layer, per head, we
display the mode.

<figure>
<img
src="https://storage.googleapis.com/normal-blog-artifacts/extended-mind-transformers/explainability.png"
alt="Tokens retrieved during the generation of token “1971”" />
<figcaption aria-hidden="true">Tokens retrieved during the generation of
token “1971”</figcaption>
</figure>

Simple methods like this are just the beginning, but granular citations,
in fact causal citations at all, are currently impossible using methods
like RAG. The best we can get is highlighting those sections which were
chosen to include in context. Using self-attention weights can buy you
something (?), but this is unwieldy data and it’s explanatory power has
been [questioned](https://arxiv.org/abs/1902.10186).

## Active Externalism for the Enterprise

As a last note, active externalism is clearly preferable for enterprise
solutions where the data and model are colocated. Inference is a
one-stop shop - a single call to the model. No need for tricky
fine-tuning jobs, and certainly no need to send your data to an
externally hosted vector database. This eradicates the potential for
data leakage linked to those methods, while also directly eliminating
any extra latency associated with sending data to and from a hosted
service.

## Creating external memories

There are many interesting hyperparameters to discuss related to active
externalism. The role model size plays, what using active externalism on
a subset of decoder layers might look like, alternative masking
strategies, just to name a few. We’ll leave most of the discussion for
more technical forthcoming papers. However we felt it was important to
mention briefly the hyperparameters used in generating the external
memories. We create our external memories (at each layer) by passing
those external contexts through our model, just like inference. Then we
save those internal representations the model generated, and attend to
them later. If our external memories are longer than the model’s maximum
sequence length, we’ll usually want to generate our representations
using a stride. This ensures that all tokens are conditioned on at least
stride-length number of previous tokens. Intuitively, all our memories
will have “seen” some reasonable amount of context. However, there are
situations where increased context may not be aligned to the model’s
representation of the data. For instance, representations of numerical
or log-type data may benefit from using a smaller sequence or stride
length.

## Summary

At Normal, we believe that there remains a wealth of opportunity to
uncover by approaching today’s fractured, albeit proliferative,
Enterprise AI landscape from a first principles point of view – even,
and arguably especially, where early consensus has begun to form. We
strongly believe that interdisciplinary perspectives and research are
essential for advancing the field, a fundamentally and historically
cross-sectional and constantly evolving discipline.

In “The Extended Mind” Clark and Chalmers conjecture: “In the distant
future we may be able to plug various modules into our brain to help us
out: a module for extra short-term memory when we need it.”

While this remains a distant goal for humans, we propose a method for
achieving exactly this kind of short-term memory boost for LLMs. We’ve
shown how a simple and natural extension of the self-attention mechanism
for LLMs enables SoTa performance on retrieval tasks over long
documents, uncertainty awareness, abstraction levers, granular
explainability, and perhaps even given us some insight into the way
these models reason internally.

## What’s next

The results in this blog were all generated using the Mosaic MPT model.
One of the challenges to extending the work to newer models like Llama2
is that they fundamentally change the way that the prompts are encoded.
MPT uses [ALiBi](https://arxiv.org/abs/2108.12409) encodings in place of
traditional positional encodings. Llama2, on the other hand, uses
[ROPE](https://arxiv.org/abs/2104.09864) encodings. While this may seem
like a relatively small change, it is important both in terms of model
behavior and in the specifics of how the attention mechanism needs to be
modified to incorporate its extended memory.

We also think it would be worth our time to delve deeper into the way
the memory mechanism in the EMT can be used to facilitate model
explainability. Early experiments show that the memories that are
attended to are relevant to the outcome, but we want to delve deeper!

And finally, we need comprehensive benchmarks to understand fully the
situations where the EMT is preferable to in-context learning or RAG.

## References

Buchen, Patrick. 2018. “In Defense of Otto and His Extended Mind.” 2018.
<https://medium.com/@pnbuchen/in-defense-of-otto-and-his-extended-mind-9786db756f2d>.

Burtsev, Mikhail S., Yuri Kuratov, Anton Peganov, and Grigory V.
Sapunov. 2021. “Memory Transformer.” <https://arxiv.org/abs/2006.11527>.

Clark, Andy, and David Chalmers. 1998. “The Extended Mind.” *Analysis
58*, no. 1: 7–19. <http://www.jstor.org/stable/3328150>.

Liu, Nelson F., Kevin Lin, John Hewitt, Ashwin Paranjape, Michele
Bevilacqua, Fabio Petroni, and Percy Liang. 2023. “Lost in the Middle:
How Language Models Use Long Contexts.”
<https://arxiv.org/abs/2307.03172>.

Martins, Pedro Henrique, Zita Marinho, and André F. T. Martins. 2022.
“$\infty$-Former: Infinite Memory Transformer.”
<https://arxiv.org/abs/2109.00301>.

Press, Ofir, Noah A. Smith, and Mike Lewis. 2022. “Train Short, Test
Long: Attention with Linear Biases Enables Input Length Extrapolation.”
<https://arxiv.org/abs/2108.12409>.

Sukhbaatar, Sainbayar, Edouard Grave, Guillaume Lample, Herve Jegou, and
Armand Joulin. 2019. “Augmenting Self-Attention with Persistent Memory.”
<https://arxiv.org/abs/1907.01470>.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023.
“Attention Is All You Need.” <https://arxiv.org/abs/1706.03762>.

Wu, Yuhuai, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy.
2022. “Memorizing Transformers.” <https://arxiv.org/abs/2203.08913>.