7 min read

Semi-supervised clustering with logic programming

Summary

I motivate clustering as a problem well suited to logic programming in the general case, and I volunteer a couple artisanal clustering algorithms in Prolog demonstrated on some mock data.

Note: the code herein is my own. If you see bugs, or are a Prolog mage and can write it even more concisely, I’d be grateful if you could let me know.

Introduction

There are many clustering algorithms born in specific circumstances such as k-means (via vector quantisation), biclustering (via gene expression analysis), DBSCAN (via spatial analysis) and so on, which went on to mostly be abused in the general setting. Taking k-means as an example, to make use of it you need an n-dimensional Euclidean feature space in which the dimensions are commensurate and appropriately scaled relative to each other. I asked ChatGPT to give me an example of typical k-means use cases and it said “marketing segmentation”, with example features being things such as age, gender, income, education level, marital status and occupation: good luck creating commensurate features in a Euclidean space out of those. I think that outside of specialist setting, this kind of abuse if par the course largely because there is a mismatch between the shape of the information available in business settings and what the algorithm is expecting.

I think that in the course of general commercial decision making, grouping customers/products/subjects/etc in order to divide-and-conquer – or at least tojust compartmentalise concerns – is normal and intuitive. However, in my experience, nobody is making n-dimesional vector spaces in their heads or trying to metric learning distance functions; they are just using logic to capture divisions they intuit from experience. For example, “a significant portion of ‘new parent’ expenditure is on their child. As a cohort, their needs for the first few years follow a predictable story arc on average. We’d like to plan the retail experience for this group separately”. Sounds familiar enough, no? How should we do it? Should we determine a suitable vector space and then fit an SVM to find the hyper-plane which separates the “new parents” from the other shoppers? No, why(?!) – and really in many circumstances – how even?! Let’s instead just write down some rules which qualify a shopper as a “new parent” – things like “has bought nappies more than once in the last month” – and then let’s look at what those people tend to buy and expand the rules a bit: that’s how we will identify new parents, everyone will understand it, and we will repeat the process for other customer types.

Logic programming starts to make sense as soon as a problem becomes more about rules and relationships than other parts of mathematics, because it makes it easy to express lots of rules – like the ones mentioned above – in a context you can do something with them. In the clustering scenario, there are at least four ways domain knowledge and asssumptions could be used to augment a clustering algorithm:

  1. Force linking – decide which items must appear in the same clusters.
  2. Force unlinking – decide which items must not appear in the same cluster.
  3. Logical linking criteria – potentially extensive rules regarding what must be true for items appearing in the same cluster.
  4. Distance measure – some metric which can be used to decide how similar two items are. In the new-parent example above that could be (for instance) a score measuring how much a customer’s purchases resemble those of a reference group of new parents.

If your algorithm of choice supports non-Euclidean distances, it may possible to smuggle some of the above into the distance metric, but it will be difficult in the general case. However, it is straightforward with logic programming as I’ll aim to show in the following section.

Clustering in (SWI) Prolog

Here is some simple – without loss of generality – mock data. It contains 9 subjects each assigned an arbitrary Quantity and Category. It also contains some stubs (must_link/2, must_not_link/2) which will later be used to force linking/unlinking.

% subject(Id,Quantity,Category)
subject(1,11,m).
subject(2,12,m).
subject(3,13,m).
subject(4,20,m).
subject(5,21,f).
subject(6,22,f).
subject(7,23,f).
subject(8,40,f).
subject(9,44,f).

must_link(none,none).
must_not_link(none,none).

I think it makes sense to think of clustering as a logical process. That is, let’s start by asking ourselves, what we would like the final clusters to look like and what we would like to be true about them. In the example below, can_link/3 checks that \(X \ne Y\), that X,Y are in the same category, that their Quantity difference is less than 5, and that there are no directives which prevent them from being linked (must_not_link/2).The predicate also exposes a distance measure calculated as the absolute difference between the Quantities. The predicate can_link/2 is identical except that it does not expose the distance measure: this is used in the simpler clustering example below. Note, I’ll cover the inclusion of must_link/2 as part of the clustering algorithms.

can_link(D,X,Y) :-
    subject(X,Q1,C1),
    subject(Y,Q2,C2),
    X \= Y,
    C1 = C2,
    abs(Q1-Q2) < 5,
    \+ must_not_link(X,Y),
    \+ must_not_link(Y,X),
    D is abs(Q1-Q2).
can_link(X,Y) :-
    can_link(_,X,Y).

Say further that we wanted “complete linkage”, meaning everything in the same cluster links with everything else: a realistic desiderata. Incredibly, the above link predicate and forced constraints can be implemented into a clustering algorithm in just 16 sparse lines of Prolog.

cluster([],[],C,C).
cluster(Xs,E,A,C) :-
    member(X,Xs),
    member(Y,E),
    (must_link(X,Y);must_link(Y,X)),
    select(X,Xs,T),
    cluster(T,[X|E],A,C),!.
cluster(Xs,E,A,C) :-
    member(X,Xs),
    maplist(can_link(X),E),
    select(X,Xs,T),
    cluster(T,[X|E],A,C),!.
cluster(X,E,A,C) :-
    cluster(X,[],[E|A],C).
cluster(X,C) :-
    cluster(X,[],[],C).

The code above starts with an empty cluster and repeatedly adds members to it, each time checking that the new member either must link to one of the existing members or that it can link to all existing members. Once no further links are possible, the cluster is added to the accumulator, a new cluster is started, and the process is repeated with the remaining members. It can be invoked as follows for the data above:

?- findall(X,subject(X,_,_),Xs),cluster(Xs,Clusters).

Xs = [1, 2, 3, 4, 5, 6, 7, 8, 9],
Clusters = [[9, 8], [7, 6, 5], [4], [3, 2, 1]].

We can use the forced linking/unlinking as follows:

...
must_link(1,7).
must_not_link(9,8).

?- findall(X,subject(X,_,_),Xs),cluster(Xs,Clusters).

Xs = [1, 2, 3, 4, 5, 6, 7, 8, 9],
Clusters = [9], [8], [6, 5], [4], [3, 2], [7, 1]].

It is again noteworthy that this intuitive setup is easy to code yet offers a powerful semi-supervised clustering environment. We can complicate it further by taking into account the distance metric and making the cluster formulation strategy “agglomerative”. That is, rather than building one cluster at a time, we can merge subjects into clusters bottom-up, in the order of nearest pairs first at every recursion. The algorithm below is initialised with all the subjects as singleton clusters. It recursively finds subjects which either (1) must link or (2) can link and are the closest pair which can link given the distance set by can_link/3. These pairs are merged together and added back to the end of the list of clusters. This process is repeated until it is not possible to merge any more clusters. The final code add is still only 40 sparse lines of Prolog and supports all four kinds of customisation listed above.

to_singles([],[]).
to_singles([X|Xs],[[X]|Ys]) :-
    to_singles(Xs,Ys).

must_link_cluster(X,Y) :-
    member(X0,X),
    member(Y0,Y),
    (must_link(X0,Y0);must_link(Y0,X0)).

can_link_cluster(Dist,X,Y) :-
  setof(
  D,
  X0^Y0^(
    member(X0,X),
    member(Y0,Y),
    can_link(D,X0,Y0)),
  Pos),
  length(Pos,L1),
  length(X,L2),
  length(Y,L3),
  L1 =:= L2*L3,
  last(Pos,Dist).

cluster_(Xs,C) :-
    (member(X,Xs),
     member(Y,Xs),
     must_link_cluster(X,Y);
     setof(
      D-[X0,Y0],
      X0^Y0^(
        member(X0,Xs),
        member(Y0,Xs),
        can_link_cluster(D,X0,Y0)),
      [_-[X,Y]|_])),
    select(X,Xs,T1),
    select(Y,T1,T0),
    append(X,Y,M),
    append([M],T0,T),
    cluster_(T,C),!.
cluster_(X,X).
cluster(X,C) :-
    to_singles(X,Xs),
    cluster_(Xs,C).

We can invoke it as before (forced linking constraints included) as follows:

?- findall(X,subject(X,_,_),Xs),cluster(Xs,Clusters).

Xs = [1, 2, 3, 4, 5, 6, 7, 8, 9],
Clusters = [[5, 6], [2, 3], [1, 7], [4], [8], [9]].

The code above is inefficient because it will result in significant amounts of recalculation on every recursion. This can be greatly improved by using SWI Prolog’s memoisation feature known as “tabling”. The primary predicates we should table are the linking predicates. This can be done by placing the following directives prior to their definitions:

:- table can_link/3.
:- table can_link_cluster/3.
:- table must_link_cluster/2.

For large data this should results in speed ups in the order of magnitudes.

Conclusions

In this post I’ve presented a case for logic programming as a natural fit for clustering scenarios in general commercial use cases due to the ease with which logically stated domain knowledge can be integrated with cluster building algorithms to achieve effective and explainable semi-supervised clustering. I’ve also shared two cluster building predicates which support many ways of adding domain knowledge to the clustering process whilst remaining very simple to express in Prolog.