Reducing LLM epistemic slop

Abstract

This article is about how to use LLMs as an approximate joint probability distribution over tokens rather than as an expert system. I show how multinomial/ordinal queries with grammar constraints avoid errors related to greedy recursive generation, allow for uncertainty quantification via logits, and enable robust inference via invariant query reformulations which expose logical inconsistencies. For the binomial case, I also show how this additional information can be simply combined using the Beta distribution.

Introduction

LLMs: an approximation of the joint probability distribution over token sequences fitted to a big bag of tokens we find useful. Most popularly, this distribution is used for recursive completion wherein at every step we use \(P(x_t \mid x_{<t})\) to decide the next token \(x_t\) given a context \(x_{<t}\), and so generate continuations. It turns out that these continuations can perform surprisingly well even when deep knowledge or complex reasoning would be required in the human context.

Some complications – among many – are that the joint distribution is approximate, it is not equally applicable to every possible context, and the greedy process of recursive completion is sub-optimal by construction: there is no reason that the most likely completion should be the one constructed recursively using the local maximum. Many of these problems are exponentially worse for small LLMs, and all this contributes to epistemically unreliable output: epistemic slop.

This post describes some tricks and statistical treatment which I’ve found practically useful in trying to use LLMs more like a probability distribution and less like an expert system. I’ll use llama.cpp as the inference framework for the examples. The llm object used throughout is declared like this:

from llama_cpp import Llama

llm  = Llama.from_pretrained(
    repo_id  = "<repo>", filename = "<model>",
    n_gpu_layers=-1, logits_all=True,
    verbose=False)

Syntactic constraints

That is, constraints on the structure of output. If your prompt expects a specific output structure, then the space of correct answers intersects with the space of continuations in the expected structure. Hence, constraining output reduces noise.

Backus-Naur form (BNF) grammar

BNF is a formal notation for describing context-free grammars which are in turn used to specify valid output. llama.cpp supports GBNF out-of-the-box for constraining token generation. Given a context, the feature works by masking out all tokens which are not valid according to the grammar, and sampling from the remainder, thereby guaranteeing that the output will be syntactically correct. Here is an example, wherein output is constrained to just “Yes”, “No” or “Maybe”.

from llama_cpp import LlamaGrammar

GRAMMAR = LlamaGrammar.from_string(
  r'root ::= "Yes" | "No" | "Maybe"')

resp = llm.create_chat_completion(
  messages=[
    { "role": "system",
      "content": "Answer only Yes, No or Maybe.", },
    { "role": "user",
      "content": "Is Paris the capital of France?", },],
  temperature=0.0,
  grammar=GRAMMAR,)

print(resp["choices"][0]["message"]["content"])

>> Yes

Prompt divergence

Since every token is a possible continuation of every prompt (more or less), constraining the shape of the output offers no assurance of prompt correctness. For example, constraining the output to some complex JSON and asking the LLM "Do you like cats?" will still produce syntactically valid complex JSON. So, the correspondence between prompt and output is empirical, and must be tested.

Where we have evals, those can be used to test correspondence, but LLMs are frequently used in an unsupervised (or weakly supervised) mode, in which the following tests are useful: (1) does the prompt without the grammar mostly return grammatically valid results? If it fails often, the grammar is masking a problem. And (2) do alternative grammars produce the same result? For example, forcing different syntax (Yes/No, 1/0, True/False, {"result": true/false}) should not change semantics, otherwise the grammar is biasing results.

Semantic constraints

That is, constraints on how the value of an output is determined. The space of true answers intersects the space of logically consistent answers. Hence, constraining value determination with logical consistency reduces noise.

Multinomial and ordinal queries

Inference is cleaner for queries which can be answered from some fixed enumeration mapped to a single token output, because a major source of error – generation by recursion – is entirely removed. It also means that we can use the token log probabilities to directly quantify output uncertainty. Without loss of generality, the following is a simple binomial example. Multinomial queries – those with more than two options – can be treated in the same way. The options can also be ordered (ordinal) without further issue.

import math
import numpy as np


GRAMMAR = LlamaGrammar.from_string(r'root ::= "Y" | "N"')
sys_p   = 'Answer only "Y" for Yes or "N" for No.'
q       = 'Is Paris the capital of France?'
resp    = llm.create_chat_completion(
  messages=[
    { "role": "system", "content": sys_p, },
    { "role": "user", "content": q, }, ],
  max_tokens=1, temperature=0.0,
  grammar=GRAMMAR,
  logprobs=True, top_logprobs=10,)

lps  = resp["choices"][0]["logprobs"]["content"][0]
lps  = dict((x["token"], x["logprob"],) \
             for x in lps["top_logprobs"])
y, n = lps["Y"], lps["N"]
p    = float(math.exp(y - np.logaddexp(y, n)))

print("P(Y | q) =", round(p,3))

>> P(Y | q) = 0.999

In the code above, output is restricted to one character which must be either Y or N. We use the create_chat_completion function to recover the first token candidates with the highest log probabilities, and extract values for Y and N. Using a grammar here isn’t strictly necessary but it makes it more likely that our mapped tokens will be in the top N tokens. Since the possible answers span the whole space of possibilities, we can use the fraction \(p(Y\mid q) / (p(Y\mid q)+p(N\mid q))\) – which is presented in log space above – as a probability measure.

The big advantage of multinomial/ordinal queries is that their logical implications can easily be tested and compared. For example, the model probability that Paris is the capital of France should be directly related to probabilities for the following queries:

Paris is not the capital of France.
Is Paris the only capital of France?
Is Paris the capital of Germany?

All the above should correlate with either \(P(Y\mid q)\) or \(1 - P(Y \mid q)\). If they diverge, there is a logical inconsistency.

Queries which imply more than two options can also be validated against binomial queries which combine options. For example, if \(X \in \{A,B,C\}\) we can cross-check \(P(X=A \mid q)\) against \(P(X \notin \{B,C\} \mid q)\). Similarly, ordinal queries can be validated against ranges. For example, if \(X \in \mathbb{Z}\) and \(X \in \{1,\ldots,5\}\), then \(P(X=1 \mid q) \le P(X < k \mid q), \quad k=2,\ldots,6\).

Invariants

An invariant – in this case – is a perturbation of the input which is expected not to change the output. I’ll mention three useful classes of invariants worth testing for:

Paraphrase – changing the input to something semantically equivalent but worded differently should not change the output. Changing capitalisation or the query language should not change the output.
Label shuffling – the order in which options are presented to the LLM and the token to which they are mapped should not make a difference to the output.
Nuisance variation – Adding irrelevant information or over-determining the query should not result in a different answer. For example, “If the sky is blue, is Paris the capital of France?”.

Integrating evidence

In the binomial case, if all our invariants are presented as reformulations of the query which should yield the equivalent to either \(P(Y\mid q)\) or \(1 - P(Y \mid q)\), then one way to integrate them into a single judgement is to consider them as a not-so-random sample from the distribution of equivalent queries and fit a Beta distribution. Let \(X = \{ P(Y \mid q_i) \}\) for all invariants \(q_i\) such that \(X \sim Beta(\alpha, \beta)\). More usefully, we can reparameterise in terms of location and concentration:

\[ \mu = \frac{\alpha}{\alpha + \beta}, \phi = \alpha + \beta \] It is simple to fit \(\mu,\phi\) numerically after which we can produce an expectation, lower bound estimates and confidence intervals among other standard artefacts. Here is an example:

import numpy as np
from scipy.optimize import minimize
from scipy.special import expit, logit
from scipy.stats import beta, norm


def nll(theta):
    mu, phi = expit(theta[0]), np.exp(theta[1])
    return -beta.logpdf(X, mu * phi, (1 - mu) * phi).sum()

X       = np.array([0.642, 0.731, 0.691, 0.694, 0.732,
                    0.604, 0.679, 0.717, 0.729, 0.688])
m, v    = X.mean(), X.var(ddof=1)
theta0  = [logit(m), np.log(m*(1-m)/v-1)]
res     = minimize(nll, theta0, method="L-BFGS-B")
mu_hat  = expit(res.x[0])
phi_hat = np.exp(res.x[1])
se_eta  = np.sqrt(res.hess_inv.todense()[0, 0])
se_mu   = mu_hat * (1 - mu_hat) * se_eta
ci      = mu_hat + norm.ppf([0.025, 0.975]) * se_mu

print("E[X]:", mu_hat)
print("95% CI:", ci)


>> E[X]: 0.6906235678272028
>> 95% CI: [0.66616466 0.71508248]

For some random set of \(X\), the code above fits the \(\mu,\phi\) parameterised Beta distribution using L-BFGS-B, reads off the expectation and the Wald confidence interval. More involved treatments are available for multinomial and ordinal outputs.

Conclusions

Most importantly:

Grammars can be used to constrain output, which makes sense when the right answer has a specific syntax.
Multinomial/ordinal queries are evaluated with fewer sources of error, they enable us to quantify uncertainty more precisely, and they are more easily subjected to logical consistency checks.
Invariant reformulations can be pooled to produce a probability distribution from which we can report expectation and confidence intervals.