Explainable unsupervised query tagging

Implementation code is available here. Its just North of 200 lines of Python code, so after getting the gist from this post, its straightforward to read the code directly for details.

Introduction

Query understanding is a set of methods used to better interpret the intention of a search query prior to its execution: things like “query segmentation”, “entity recognition”, “rewriting”, “spelling correction”, and so on. Herein I’ll write a bit about “query segmentation” otherwise known as “query tagging”, and in particular about how we can straightforwardly put together an unsupervised query tagger using concepts from evidence theory (Dempster-Shafer).

My proof-of-concept parses map queries (e.g. like for Google Maps). It uses OpenStreetMap (OSM) data for England compiled into gazetteers (lookup tables) and elementary evidence theory operations available in my package pyEvidence. A user query (input) might be something like “fish and chip restaurant near leicester square” and the tagging (output) might be: “fish and chips [FOOD] restaurant [CAT] near [PROX] leicester square [GEO]”.

I’ll concern myself with the following tags:

POI: Point of interest: business name or a chain.
GEO: Geography: roads, areas, postcodes.
CAT: Amenity category.
FOOD: Type of food or a cuisine.
PROX: Proximity qualifier (e.g. "nearest")

Without loss of generality, this demonstration omits basic processing steps like stopword removal, spell correction, proper lemmatisation, and so on. Pre-processing is limited to the following functions:

import re
from nltk.stem import SnowballStemmer

stem = SnowballStemmer("english").stem

def normalise(q):
    assert isinstance(q, str)
    q = ud.normalize('NFD', q.lower().strip())
    q = re.sub("['`]+", "", q)
    q = re.sub("[^a-z0-9 \\-]+", " ", q)
    q = [x for x in re.split(" +", q) if x]
    return(q)

def lemmatise(q):
    return [stem(x) for x in q]

Gazetteers

The gazetteers are the central evidence inducing artefact. According to Wikipedia a Gazetteer is a “geographical dictionary or directory used in conjunction with a map or atlas”: a special lookup table. In my case each \(t \in T\), has an associated gazetteer to represent it. The code for their construction is here: its short and easy to read. I extract the data from an OSM export for England (see the setup instructions here), I count the occurrence of each term, and I store the term and the logarithm of its count in the corresponding table.

The model

To my knowledge, the following model – at least in terms of its application to query understanding, and at the time of writing – is novel. It heavily relies on evidence theory: a generalisation of probability theory which allows for the assignment of credence to sets. To the uninitiated, I’d suggest starting here and getting acquainted before continuing further.

Let \(Q = (w_1,...,w_m)\) be a query consisting of \(m\) words, \(T = \{poi,geo,cat,prox,food\}\) be the set of possible labels, \(H = \{h = (t_1,...,t_m), t_i \in T\}\) be the set of all possible labellings, and \(G_n = (g_1, ..., g_{m-n+1})\) be a sequence of \(n\)-grams from \(Q\). Let \(\Phi_t: G_n \to \mathbb{R}^+, t \in T\) be the gazetteer function for token \(t\), which given an \(n\)-gram returns a positive real value, monotonically increasing with the number of occurrences of the \(n\)-gram for that gazetteer.

For any given \(n\)-gram, let the \(N(q) = \sum_{t \in T} \Phi_t(q)\) be the sum across all gazetteers. A simple probability measure of \(q\) occurring in gazetteer \(t\) is given by \(\Phi_t(q) / N(q)\). The measure is relative, and sensitive to the total count of the sought term in the gazetteers, but I want probability assignments to be tempered by \(N(q)\). That is, if the \(n\)-gram rarely occurs, I’d like to be proportionally less confident in judgement about it. Any monotonically increasing function with range \([0,1]\) is admissible for the purpose, but I like the Poisson CDF \(\pi_n(x) = P(X \le x)\) where \(X \sim Poisson(\lambda_n)\), because it is easy to calibrate intuitively. Note that the parameter \(\lambda_n\) is different for each \(n\)-gram set, since \(1\)-grams will typically be more common than \(3\)-grams.

I am now in a position to produce evidence. The purpose of evidence is to change our certainty with regard to subsets of the hypothesis space \(H\). I construct a separate mass assignment (evidence source) for every \(g_i \in G_n \mid n \in \{1,2,3\}\) as follows: \[m_n(H) = 1 - \pi_n(N(g_i))\] \[m_n(H(g_i, t)) = \pi_n(N(q)) \cdot\Phi_t(q) / N(q)\] where \(H(g_i, t) \subseteq H\) are the hypotheses which set the positions occupied by \(g_i\) to token \(t\) but preserve all other possibilities. In words, the vacuous set is assigned the sample size uncertainty \(\pi_n(N(g))\) whilst the complement is divided across each tag in direct proportion to its gazetteer count.

Aside from the gazetteers, the only other source of evidence in this demonstration comes from the “priors”. I use two priors over 1-grams: (1) a weak prior which defaults to GEO for things which look like numbers or postcodes, and (2) a very weak prior which defaults to POI in all other cases. Have a look at the priors function in query.py for details.

Let \(E\) be the set of all evidence sources compiled as above. For each \(g_i \in G_1\), I infer the labels for the 1-grams as follows:

\[\arg\max\limits_{t \in T}\bigg\{belief\big(H(g_i,t) \mid\bigoplus\limits_{\forall e \in E} e\big)\bigg\}\] That is, for each \(g_i\) in turn, I independently pick the tag which implies the subset of hypotheses with the mos belief given all the \(n\)-gram evidence over all tags.

Note that although inference occurs per \(1\)-gram, it is still heavily affected by all evidence which overlaps it due to the nature of combination in evidence theory. For example, given a sequence \((A,B,C,D,E)\), in the setup above, the label of \(C\) would be potentially affected by the evidence for all of the following subsequences: \(\big\{(C), (B,C), (C,D), (A,B,C), (B,C,D), (C,D,E)\big\}\).

Results

Here are some tagging examples from the query.py output:

kdjalksjdalkjd [POI]
N17 [GEO]
tesco extra [POI]
boots [POI]
costa [POI] coffee [FOOD]
greggs [POI]
pizza hut [POI]
waitrose [POI]
sainsburys local [POI]
burger king [POI]
caffe nero [POI]
london [GEO]
mcdonalds [POI]
starbucks [POI]
restaurant [CAT]
near me [PROX]
16 high street oxford ox1 4ag [GEO]
32 deansgate manchester m3 4lq [GEO]
14 station road cambridge [GEO]
27 the shambles york yo1 7lz [GEO]
12 church lane leeds [GEO]
3 market place norwich nr2 1nd [GEO]
21 kings road brighton [GEO]
nearest [PROX] petrol station [CAT]
47 tavistock road london n1 [GEO]
sainsburys [POI] supermarket [CAT] in [PROX] brighton [GEO]
closest [PROX] tesco [POI] in [PROX] plymouth [GEO]
greggs [POI] bakery [CAT] near [PROX] newcastle [GEO]
starbucks [POI] coffee shop [FOOD] near me [PROX]
nearest [PROX] dentist [CAT] in [PROX] norwich [GEO]
boots [POI] pharmacy [CAT] near [PROX] leicester [GEO]
dominos [POI] pizza [FOOD] in [PROX] sheffield [GEO]
nearest [PROX] train station [CAT] in [PROX] liverpool [GEO]
primark [POI] store [CAT] in [PROX] nottingham [GEO]
aldi [POI] supermarket [CAT] near [PROX] derby [GEO]
morrisons [POI] supermarket [CAT] in [PROX] bolton [GEO]
closest [PROX] petrol station [CAT] near me [PROX]
nearest [PROX] pub [CAT] in [PROX] windsor [GEO]
thai [FOOD] food restaurant [CAT] in [PROX] reading [GEO]
closest [PROX] chinese [FOOD] in [PROX] leeds [GEO]
the shard [POI] in [PROX] london bridge [GEO]
closest [PROX] museum [CAT] in [PROX] bristol [GEO]
the ivy [POI] restaurant [CAT] in [PROX] cheltenham [GEO]
closest [PROX] fish and chips [FOOD] in [PROX] whitby [GEO]
cinema [CAT] in [PROX] leicester [GEO]
closest [PROX] hospital [CAT] near me [PROX]
waterstones [POI] in [PROX] cambridge [GEO]
lego [POI] store [CAT] in [PROX] london [GEO]
zoo [CAT] in [PROX] london [GEO]
mcdonalds [POI] in [PROX] stratford [GEO]
tube station [CAT] near [PROX] leicester square [GEO]
hotel [CAT] close to [PROX] brighton [GEO] peer [POI]
yo sushi [POI] restaurant [CAT] in [PROX] london [GEO]
burger [FOOD] restaurant [CAT] in [PROX] leicester [GEO]
italian [FOOD] restaurant [CAT] in [PROX] chester [GEO]

Here is an example of all the information available to explain – and understand the tagging a query. Below shows the results for “pie and mash, tavistock square” which is correctly tagged “pie and mash [FOOD] tavistock square [GEO]”. There are 14 evidence sources which apply, starting with the POI prior, and the \(n\)-grams.

>>> tag_query("pie and mash, tavistock square")
--------------------------------------------------------

len(evidence) = 14.
No. 0
	 {'POI'} * * * * -> 0.001
	 * * * * * -> 0.999
No. 1
	 * {'POI'} * * * -> 0.001
	 * * * * * -> 0.999
No. 2
	 * * {'POI'} * * -> 0.001
	 * * * * * -> 0.999
No. 3
	 * * * {'POI'} * -> 0.001
	 * * * * * -> 0.999
No. 4
	 * * * * {'POI'} -> 0.001
	 * * * * * -> 0.999
No. 5
	 {'POI'} {'POI'} * * * -> 0.0087
	 {'GEO'} {'GEO'} * * * -> 0.0
	 {'CAT'} {'CAT'} * * * -> 0.0
	 {'FOOD'} {'FOOD'} * * * -> 0.3913
	 {'PROX'} {'PROX'} * * * -> 0.0
	 * * * * * -> 0.6
No. 6
	 * * {'POI'} * * -> 0.0118
	 * * {'GEO'} * * -> 0.0
	 * * {'CAT'} * * -> 0.0
	 * * {'FOOD'} * * -> 0.1882
	 * * {'PROX'} * * -> 0.0
	 * * * * * -> 0.8
No. 7
	 * {'POI'} {'POI'} * * -> 0.0087
	 * {'GEO'} {'GEO'} * * -> 0.0
	 * {'CAT'} {'CAT'} * * -> 0.0
	 * {'FOOD'} {'FOOD'} * * -> 0.3913
	 * {'PROX'} {'PROX'} * * -> 0.0
	 * * * * * -> 0.6
No. 8
	 * * * {'POI'} {'POI'} -> 0.0392
	 * * * {'GEO'} {'GEO'} -> 0.1301
	 * * * {'CAT'} {'CAT'} -> 0.0
	 * * * {'FOOD'} {'FOOD'} -> 0.0
	 * * * {'PROX'} {'PROX'} -> 0.0
	 * * * * * -> 0.8307
No. 9
	 {'POI'} {'POI'} {'POI'} * * -> 0.0087
	 {'GEO'} {'GEO'} {'GEO'} * * -> 0.0
	 {'CAT'} {'CAT'} {'CAT'} * * -> 0.0
	 {'FOOD'} {'FOOD'} {'FOOD'} * * -> 0.3913
	 {'PROX'} {'PROX'} {'PROX'} * * -> 0.0
	 * * * * * -> 0.6
No. 10
	 * * * * {'POI'} -> 0.036
	 * * * * {'GEO'} -> 0.0958
	 * * * * {'CAT'} -> 0.0313
	 * * * * {'FOOD'} -> 0.0
	 * * * * {'PROX'} -> 0.0
	 * * * * * -> 0.8368
No. 11
	 * {'POI'} * * * -> 0.0002
	 * {'GEO'} * * * -> 0.0001
	 * {'CAT'} * * * -> 0.0002
	 * {'FOOD'} * * * -> 0.1994
	 * {'PROX'} * * * -> 0.0
	 * * * * * -> 0.8
No. 12
	 {'POI'} * * * * -> 0.0084
	 {'GEO'} * * * * -> 0.0042
	 {'CAT'} * * * * -> 0.0
	 {'FOOD'} * * * * -> 0.1874
	 {'PROX'} * * * * -> 0.0
	 * * * * * -> 0.8
No. 13
	 * * * {'POI'} * -> 0.0045
	 * * * {'GEO'} * -> 0.0337
	 * * * {'CAT'} * -> 0.0
	 * * * {'FOOD'} * -> 0.0
	 * * * {'PROX'} * -> 0.0
	 * * * * * -> 0.9618
--------------------------------------------------------
'pie and mash [FOOD] tavistock square [GEO]'

Implementation notes

Evidence combination rules mostly differ in how they deal with conflict between evidence sources. Herein I use the “Dubois-Prade” rule which allocates conflicting mass to the disjunction of disagreeing subsets. That is, if \(A\) and \(B\) disagree, then the truth is assumed to be one or the other (neither, would be an alternative rule).
Naive inference in evidence theory has exponential time complexity due to calculations performed over powersets. Part of the contribution of pyEvidence – used herein – is computationally efficient approximate representations and algorithms. query.py performs inference using both a coarsening algorithm and Monte Carlo, in order to illustrate both. Monte Carlo is about 10x slower than coarsening but it can be arbitrarily precise limited only by the number of simulations.
Good results are limited to cases within the England dataset and which are not hamstrung by the lack of basic pre-processing. I.e. the lack of stopword removal, speller and so on.