Explainable Language Models: Existing and Novel Approaches

We review key aspects of explainability for language models and introduce some Normal innovations.

Sam Duffield, Arun Patro, and Phoebe Klett


October 20, 2023

Explainability is a much discussed topic in machine learning, and even more so with the rise of large language models (LLMs) (since their primary use is to explain things). At Normal, we are enabling AI agentic systems that must not only produce coherent outputs for a task, but also explain themselves and their reasoning.

This is a challenging but important task, as we build the capabilities to solve higher-level and more complex end-to-end human tasks (“The Act Two” of GenAI). Solving these intricate problems depends critically on reliable and auditable reasoning. This has begun to enable product form factors to evolve from the individual to system-level productivity and from human-in-the-loop to (hybrid) execution-oriented agentic systems, as in our enterprise full-stack platform at Normal.

In this post, we will review and examine some explainability techniques for language models that we’ve found successful, including our own innovations at Normal (some unsurprisingly involving probabilistic machine learning), which we use to push the frontier of reliability and reasoning in AI. We will endeavor to explain intuition but also outline some times when intuition can be deceiving in practice.

A good place to start is to define what we even mean when we discuss explainability in the context of statistics and machine learning. Many different definitions have been proposed and used, sometimes with distinction between “explainability” and “interpretability”. Here, we will not distinguish and use both to mean ability to present in understandable terms to a human following Doshi-Velez and Kim in Towards A Rigorous Science of Interpretable Machine Learning.

On our tour of techniques we will highlight those that we are particularly excited about—including a novel supervised hallucination detection method, transparent reasoning through tree structured thought processes, and combining generation with efficient retrieval for large contexts (beyond RAG).

Some of these techniques will be futher outlined in coming blog posts and demos, so keep your eyes peeled!

The Two Cultures

Over two decades ago Leo Breiman controversially examined the two cultures of classical statistics and machine learning (in his terms “data models” and “algorithmic models”). He viewed the two as fundamentally different in their approach to modelling, with the former focusing on bespoke models with interpretable parameters and the latter favouring predictive performance above all else. He heavily argued in favour of a more prominent uptake of the latter approach (machine learning).

“Data modeling” vs “Algorithmic Modeling” according to Breiman in 2001 — I think the population proportions might be a bit different now…
On the left is a simple classical model with a small number of parameters which all have an intuitive interpretation. On the right is a modern, large model with mostly obtuse parameters, and the only focus is on how it behaves externally.

It’s fair to say Breiman got his wish, with prediction focused models achieving ever more impressive feats, like throwing their own parties. This is no shame on classical statistical models. In settings where we have understanding of the underlying process or we desire said understanding, they are still the natural go to. In settings where we care (almost solely) about predictive performance (say, producing coherent natural language tokens), we can use larger and more flexible models.

The cost of focussing purely on prediction is that we lose model explainability, and at a fundamental level.

This doesn’t mean we cannot seek explainability in our modelling process, it just means we have to approach explainability from a different perspective. Denoising diffusion models (which have revolutionised generative image synthesis) achieve a level of explainability by changing the problem. By fully specifying a noising process, we can then use a machine learning model to learn its reversal. Although the model architecture may still lack explainability, we have a complete understanding of the noising process and therefore a better understanding of how to use the model (including for cool things like partial denoising and image interpolation).

In the context of language models, we can seek explainability as we might with a human1. In particular, we can ask the model to explain itself by

  1. Querying how confident it is in its output
  2. Detailing its reasoning
  3. Referencing its sources

The rest of this post will be dedicated to exploring some natural ideas to achieve these goals. We will also highlight some of the pitfalls and challenges that we have encountered along the way.

A collection of techniques that we will not touch upon here come under the umbrella of more trustable output. These include techniques like editing factual knowledge and guided generation (which if you did not know is blazingly fast in Outlines 〰️). These techniques are tangentially related to explainability through enforcing some guarantees on the model output, we will here though focus on techniques that augment the output with more information relating to explainability.

If you want more on the separation of the two cultures, then Boaz Barak’s blog plus the stories and references within are a great read.

Hallucinations and Uncertainty 🤷

One of the most compelling motivations for better explainability in language models is to understand hallucinations. Language models behave quite differently to humans when they don’t know something. If you ask a (reasonable) human a difficult question they might well answer with “I don’t know” or “I don’t know but this is my best guess” or “I don’t know but here’s something I do know that is relevant”. Language models on the other hand will typically present an incorrect response (a hallucination) with a convincing argument.

Nice try ChatGPT but “whiskers” is not a 7 letter word and I won’t get started on the explanation…

This behaviour is primarily a result of the model’s training data containing very few examples of “I don’t know”s — for example, you don’t find many “I don’t know”s on Wikipedia. This could potentially be remedied at great expense by training or fine-tuning on heavily curated datasets.

Token-level probabilities represent the sequential distributions that the model samples from. Tokens are the model’s representation of characters and words, a token vocabulary includes all characters and most common words - here Llama-2-7b has tokens “Cris” and “Lion” but not “Cristiano” and “Lionel” which would be constructed by combining two or more tokens.

A simpler idea is to look at the token-level probability distributions provided by the model. If the model is hallucinating, we might expect to see a large number of tokens generated from high entropy2 distributions. However, this is not necessarily the case. There are many very natural instances when the token generations might be uncertain without hallucinating, such as when multiple synonyms are possible or the token begins a new sentence or indeed for subjective completions (such as The best footballer of all time is …).

There are many other measures of uncertainty that can be used at either token-level or sequence-level. A particularly compelling idea is that of semantic entropy, which measures the uncertainty of the generating distribution over meanings. Meaning is determined by a small helper model assessing an ensemble of generated sequences. Importantly, the helper model does not need to know all of the context of the generative model3, it only needs to be able to detect when two sequences have the same meaning. Calculating semantic entropy can be computationally expensive, to start with it requires the generation of \(M\) sequences and then up to around \(M^2\) comparisons4 although this can mostly be parallelised and is generally dominated by the calls to the generative model.

Evidently, programmatically detecting when a model hallucinates is a tricky problem. An idea that applies to a single sequence generation is that of self-evaluation, where the model is itself asked to evaluate its own output. Interestingly, evidence suggests that this approach can work poorly when asked via natural language (context or prompts that induce a hallucination also confuse the model when acting as an evaluator). There is however potential in training a model with an additional confidence head corresponding to the probability of the output being correct. The hallucination detector can also be separated from the language model itself and trained independently, although it can still use internal model symptoms as features. Datasets for training hallucination detection classifiers are fairly easy to generate through simply inserting (and labelling) incorrect information into existing datasets.

Demo: Supervised Hallucination Detection 💑

Rather than training a bespoke model, we can also analyse uncertainty by comparing against alternative pre-trained language models — there are certainly a fair few out there!

Speculative decoding (which is coming soon to Outlines〰️) uses this idea to speed up sequence generation. In speculative decoding, a small draft model \(q\) is used for sequential generation whilst a bigger model \(p\) is used for fast parallel evaluation, combining this with rejection sampling results in a nice speedup for sequence generation (from exactly the same distribution as would be generated by the bigger model).

At Normal, we adapted this concept for (supervised) hallucination detection. Instead of rejecting generations we can report each token’s rejection probability5 as a probability of hallucination. In this case, the sequence is an unaltered generation from the smaller model \(q\). This approach is particularly nice as it can be applied to any model and does not require any additional training. You might well ask why wouldn’t we just apply speculative decoding instead and generate better sequences? Well the two are compatible! The smaller model could well generate using speculative decoding, the fact still remains that an even larger model can be used to detect hallucinations, post-hoc and in parallel at little extra cost in execution time. However, unlike semantic entropy, this supervised approach does require the larger model to have access to the full context of the smaller model.

In the above we compare token-level uncertainty for two prompts sent to Llama-2-7b. We consider token-level entropy and the aforementioned supervised procedure where Llama-2-13b was executed in parallel with the rejection-probabilities reported. In the first prompt we see that Llama-2-7b’s favourite Harry Potter book is The Prisoner of Azkaban, the entropy metric demonstrates high uncertainty over “The” at start of the output representing the many plausible completions but certainly not a hallucination. The supervised approach remedies this and correctly identifies the only element of uncertainty in the completion — the subjectivity6 of the “Pr” token (after which the rest of the sequence is fully determined). In the second prompt we see a genuine hallucination rather than subjective uncertainty. The smaller model incorrectly thinks that 1 is a prime number. This is highlighted as a hallucination by the larger model whereas the entropy metric fails to identify higher uncertainty in this token than in subsequent more plausible tokens.

Supervised hallucination detection represents a particularly interesting avenue for reliable uncertainty quantification of language model outputs, particularly as the supervisory model can be a mixture-of-experts or an ensemble of models, where in this setting the uncertainty is averaged over different models and potentially even different training datasets.

Ability to Reason 🤔

Can we get a language model to demonstrate logic or reasoning? If we can, this will help us understand the model output and perhaps even improve its performance in complex reasoning tasks (e.g. word games, arithmetic or code generation).

A natural idea is to train or fine-tune the model on datasets that contains lots of explanations. If the training data contains lots of explanations then so will the model output. This idea is very general but requires training expense and heavily curated datasets (e.g. with human annotations).

An approach that can be applied directly to pre-trained models is scratchpad or chain-of-thoughts style prompting. The idea here is simple, we can prompt the model to provide reasoning by including some reasoned examples within the prompt or even asking it to “think step by step”.

Zero-shot chain-of-thoughts prompting taken from Kojima et al.

This technique has proved popular and successful. It has spawned extensions that use iteratively refined prompts via fun geometries like trees or graphs. A key addition to iterative prompting is some notion of a state evaluator that assesses the quality of any given single chain of thoughts for the problem at hand. The thought generator and state evaluator can then be sent through a beam search (or perhaps its probabilistic cousin, sequential monte carlo) to explore a range of possibilities and output a high quality chain of thoughts.

It is important to note that, sequential prompting does not always provide perfectly coherent reasoning. Sometimes we even see incorrect intermediate steps that still lead to correct results. But by detailing its reasoning, this can still be useful in its own right, through a more explainable thought process we can still be pointed towards a valid approach or better understand why the model has struggled.

The concept of breaking a task down into multiple reasoning steps and potentially including verification steps can be generalised even further. This is the core idea behind language model cascades where probabilistic programming is used to compose models in a unified framework. Modularity is even more advantageous when the model is allowed access to external tools, transparency on how and when the model accesses these tools will also provide explainability.

The goal of an autonomous agent is that it can plan and execute subtasks itself without the need for sophisticated iterative prompting. Useful (hybrid) agentic systems – those that involve a human-in-the-loop – clearly stand to benefit from explainability to augment the difficult foremost challenges of reasoning, planning, and exploration. However, even without a human-in-the-loop, one might expect that an agent which can explain its decisions to itself can better perform self-reflection and improvement.

Demo: Tree of Thoughts 🌴

We like to draw parallels between the sequential thoughts formalism and notions of System 2 behaviour which represents slow and involved thinking. A tree style search incorporates concepts like multiple reasoning pathways and self evaluation of each thought. This deliberate slow process of exploring multiple samples not only explains the model’s rationale but can also improve the model’s problem solving ability.

We have developed a tool that transparently highlights reasoning within a tree-of-thoughts procedure, the visualisation allows us to see what reasoning pathways the model investigated, how they were ranked and pruned as well as its final conclusion.

Tree of thoughts visualisation for a HumanEval code generation task. Our visualisation highlights multiple reasoning pathways, the highlighted node raised an AssertionError on one of the test cases. This error was summarised in natural language, as seen in the sidebar, before being fed back to the language model which could then correct the solution.

The generation process combines tree-of-thoughts with Reflexion style feedback by externally executing the code and feeding back the error message to the model.

More details and a full release of the demo will be available in the near future!

Ability to Reference 📚

If the language model can accurately cite evidence, it helps us understand its output. A straightforward method is to feed it a large citation corpus and request references through chain-of-thought prompting. However, this is resource-intensive and not a strength of transformers. Beyond just being able to accept large contexts, model performance diminishes with context length. To create scalable models that can reference, they must be paired with tools for efficient information extraction or summarization.

Language models are only really good at extracting information at the start and end of long contexts, from Liu et al.

A nice idea is to enhance the model with a search engine and include the first few results within the prompt (plus links in the output). This idea can be extended by using reinforcement learning with human evaluation to encourage the model to provide plausible outputs that are supported by quote evidence that is accurate and relevant.

perplexity.ai includes the top results from a search engine in its context and output.

More sophisticated approaches to language model referencing augment the core language model with an additional knowledge retriever model that retrieves a relevant document from a corpus. The knowledge retriever can be trained independently to the language model or end-to-end as in popular approaches like REALM or RAG. Knowledge retrievers typically work by fetching the entries of a vector database which most closely match the query (where the vector database contains chunked document embeddings).

Demo: Extended Mind Transformer (EMT) 🧠

At Normal we’ve developed a method which generalizes the principle at play in RAG. Instead of selecting the relevant information once, using embeddings disjoint from our model, we allow the model itself to select memories from an external, non-differentiable cache using its own representations within each layer. We term models which use this “active externalism”, extended mind transformers. We demonstrate that transformers can leverage external memories innately; active externalism does not require fine-tuning.

We showcase perplexity results below for Mosaic’s MPT-7b model with active externalism. Perplexity is a measure of uncertainty of the model over each generated token, closely related to our cross entropy loss function. As we increase the number of external memories each query token is allowed to attend to, we see this measure of uncertainty improve (smaller perplexity is better).

Our active externalism method batches each sequence into chunks of increasing length (x-axis), and uses tokens previous to the last 2048 as external memories. We show results for a varied k, where k is the number of memories we retrieve per query token. We compare active externalism to a “truncated” baseline, which simply throws out any tokens previous to the last 2048 during perplexity computations.

Perplexity results (smaller is better).

Because the model selects the most relevant memories at each generation step, this also enables granular citations. We see above an example of highlighting the external memories which were most consulted when generating a particular token.

Tokens selected to attend to when generating the correct answer, 1971.

Look out for full details on this coming soon!!

Retrieval augmentation represents a vital subroutine for language models going forward as it not only enables explainable output supported by evidence but also allows the model to access live, up-to-date data and reduces the probability of leaking information from its training data.

What’s next?

Normal Computing builds enterprise AI decision-making agent systems. Explainability (in language models and beyond) represents a key component of our full-stack approach that includes hardware, software, algorithms and models. If you are as interested as we are in advancing the frontier of reasoning and reliability in AI then reach out to us at [email protected]!


  1. Keeping in mind that LLMs have been trained quite differently to humans, e.g. we don’t expect self-knowledge.↩︎

  2. Read high entropy, think high variance or high uncertainty.↩︎

  3. Such as any of the reasoning or referencing we will discuss later.↩︎

  4. In practice, \(M^2\) comparisons are only required if every generation is semantically distinct. Typically much fewer comparisons are required and the computation is dominated by the generation cost of the \(M\) sequences.↩︎

  5. \(1 - \min \left(1, \frac{p(\text{token}|\text{previous tokens})}{q(\text{token}|\text{previous tokens})} \right))\)↩︎

  6. PoA is surely the best film, but is it the best book?↩︎