Bootstrapping ranking models with an LLM judge

SUMMARY

I use 500 Hacker News (HN) titles and an LLM to derive an article ranking model from a user supplied preference description. The LLM supplies the labelled data, whilst Ridge regression and cheap sentence transformer embedding provides the features. The surrogate has 0.74 Spearman correlation with the LLM labels, which is remarkable given that the experiment is entirely unoptimised.

INTRODUCTION

Preference based ordering is useful for lists because it makes it more likely that you’ll find what you’re looking for if you browse from top down. Its tricky however, because it requires either that the user has preferences similar to some reference group or that there is enough data about the user to estimate their preferences.

The advent of LLMs make it possible to bootstrap a preference ranking from a user provided description of what they want see. In this article I’ll be learning to rank HN articles, and this will be the preference description I’ll use:

I am primarily interested in titles about statistics,
classical machine learning, applied mathematics and
mathematical modeling. I am also interested in simplified
architectures for LLMs and deep learning models, and software
releases for machine learning. I am not interested in corporate
announcements of product releases, rants, Ask HN posts, or
general software engineering practices/frameworks. However, I
am interested in computer science related titles, especially
related to new limiting results or algorithms. I am interested
in breakthrough science or medical related posts, but not in
general summary or opinion based articles about science or
medicine.

Note the description isn’t contrived: it expresses a complex preference with lots of ambiguity and no explicit guidance on how to resolve it.

In this post, I’ll download some titles from HN, label them with an LLM and then use sentence transformers and Ridge regression to predict the labels. The labels are probabilities and the Ridge model serves as a pointwise ranker. The implementation is kept to the simplest possible baseline.

DATA COLLECTION

I use the top 500 HN stories, and extract just the titles.

import requests as req
import pandas as pd

url_prefix  = "https://hacker-news.firebaseio.com/v0/"
top_500_url = url_prefix + "topstories.json?print=pretty"
item_url    = url_prefix + "item/%d.json?print=pretty"
top_500     = req(top_500_url).json()
titles      = [req.get(item_url % i).json()["title"] for i in top_500]

pd.DataFrame(dict(title=titles)).sample(frac=1).to_csv("titles.csv")

Here is the LLM prompt I use to collect the labels.

Make a list using the titles below, in the following
format "<title>|<probability>", which assigns a probability
that I will find the title interesting given the following
paragraph describing my preferences. Score each title
individually yourself without code assistance.

Preferences: <preference description>

List of titles: <titles>

The data obtained from the first two steps is available in the first two columns here.

SURROGATE TRAINING

I train just about the simplest possible effective surrogate using sentence transformer embeddings (all-MPNet-base-v2) and a Ridge regression. I use sentence transformers because it runs quickly on CPU, and Ridge because it is the simplest modification to a linear regression required to handle a fat matrix (#columns > #rows).

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold
from scipy.stats import spearmanr
import pandas as pd
import numpy as np


S = pd.read_csv("titles_scored.csv", sep="|", header=None)
S.columns = ["title", "score"]

X = SentenceTransformer('all-MPNet-base-v2').encode(S.title.values)

Cs, Dif = [], []
for TR, TE in KFold(n_splits=20, shuffle=True).split(X):
    M = Ridge(alpha=5).fit(X[TR], S.score[TR])
    P = M.predict(X[TE])
    corr, _ = spearmanr(P, S.score[TE])
    dif = np.median(np.abs(P-S.score[TE]))
    Cs.append(corr), Dif.append(dif)

print("Median correlation", np.median(Cs))
print("Median abs. diff.", np.median(Dif))

M = Ridge(alpha=1).fit(X, S.score)
S["score_"] = M.predict(X)
( S.sort_values("score_", ascending=False)
  .to_csv("titles_scored_approx.csv", index=False) )

RESULTS

On a 20-fold cross-validation, remarkably, this simple model achieves a median Spearman rank correlation of 0.74, and median absolute difference of 0.14. The output of model training is available in the final column here. Given that there has been no optimisation, I am happy with the implied ranking qualitatively. That is, I’d prefer if the articles were ranked this way compared to the HN default.

EXTENSTIONS AND IMPROVEMENTS

More data. I think much better results could be achieved with about 10k labelled titles because 500 it too few to represent the preference description.
Prompt engineering. I haven’t given the prompt template any thought. Spending more words to explain to the LLM what a preference ranking is and how to do it would likely improve results.
Better preference description. My description mentions many separate preferences but doesn’t describe how they should be treated relatively to each other. More details about the preference would likely improve the ranking.
A more complex model. Given more data, something like xgboost could be used to cheaply learn a more nuanced model.