9 min read

Better data analysis with logic programming

INTRODUCTION

Gentle reader, permit me to try and convince you that data analysis is better with logic programming. In this post I’ll analyse a staple dataset – the ggplot2 diamond prices – using a symbolic approach which, I will demonstrate, is able to establish a robust model, otherwise difficult to recover.

DATA

I’ll use the diamond prices data which comes with the R ggplot2 package. It consists of information about 50k+ round-cut diamonds. I’ll focus particularly on price, carat, cut, colour and clarity. That is, the price and the “4Cs” of diamond pricing, which are the primary factors in determining the value of a diamond. The omitted factors complicate the presentation without sufficient benefit, so I skip them here without loss of generality.

Carat is a weight measure, whilst cut, colour and clarity and ordinal values indicating the perfection of the diamond. There are some nice visuals here which help explain the 4Cs in more detail.

You can download the original CSV dataset here.

OBJECTIVES

Have a look through some of the analyses in this Google search here. The typical objective is to find a model connecting known factors with price, so I’ll use the same objective to keep things comparable. It is also typical to do some “exploratory” analysis to get a sense of the data, and to fish for outliers: I’ll replace this with a more logically oriented counterpart of checking for consistency.

ANALYSIS

The analysis will be carried out in Prolog with some help from R. It consists of data preparation, the addition of domain knowledge, consistency and coverage checking, and price modelling. Each step has its own section below.

Data preparation

Preparation begins with a conversion of the diamond data to Prolog atoms. Here is the R script:

require(dplyr)

X <- 
  read.csv("data/diamonds.csv") %>%
  filter(!is.na(price)) %>% # no prices.
  mutate(
    cut=factor(
      cut,
      levels=c("Fair","Good","Very Good","Premium","Ideal"),
      ordered=T),
    color=factor(
      color,
      levels=c("D","E","F","G","H","I","J")),
    clarity=factor(
      clarity, 
      levels=c("IF","VVS1","VVS2","VS1","VS2","SI1","SI2","I1"),
      ordered=T)) %>%
  mutate(
    carat=round(carat,2),
    cut=as.numeric(cut),
    color=-as.numeric(color),
    clarity=-as.numeric(clarity)
  ) %>%
  group_by(carat,color,cut,clarity) %>%
  summarise(price=round(median(price))) %>%
  ungroup() %>%
  arrange(price) %>%
  mutate(id=row_number())

writeLines(
  apply(
    X, 1,
    \(r) sprintf(
      "diamond(%s,%s,%s,%s,%s,%s).",
      r['id'], r['price'], r['carat'],r['color'],r['cut'],r['clarity'])),
  con="data/diamond.prolog")

I group the data by the 4Cs and take the median price in order to:

  • Remove duplicate rows (not useful in this analysis).
  • Output an average price per factor combination.

Grouping reduces the number of rows from 52k+ to <14k.

The ordinal factors are converted to numbers for convenience and some factors (colour and clarity) are graded on a reverse scale (hence the minus signs). writeLines emits atoms in the following format:

% diamond(Id,Price,Carat,Colour,Cut,Clarity).

You can download the symbols here.

These are some convenience predicates for accessing different bits of information for any given diamond:

price(Id,Price) :-
  diamond(Id,Price,_,_,_,_).

carat(Id,Carat) :-
  diamond(Id,_,Carat,_,_,_).

colour(Id,Colour) :-
  diamond(Id,_,_,Colour,_,_).

cut(Id,Cut) :-
  diamond(Id,_,_,_,Cut,_).

clarity(Id,Clarity) :-
  diamond(Id,_,_,_,_,Clarity).

Adding domain knowledge

I think that it is important to note that the price of diamonds is determined by supply and demand factors which are not part of the analysis: all I have is some descriptive information, none of which “causes” pricing. I do however have some domain knowledge about the structure of pricing: the 4Cs are subject to a monotonic relationship with price. That is, ceteris paribus, a bigger/clearer/etc diamond is more expensive. Here are some Prolog predicates which express the relationship between diamonds and 4C combinations.

% Diamond 'Id1' dominates diamond 'Id2'.
dominates(Id1,Id2) :-
    diamond(Id1,_,Ca1,Co1,Cu1,Cl1),
    diamond(Id2,_,Ca2,Co2,Cu2,Cl2),
    Ca1>=Ca2,
    Co1>=Co2,
    Cu1>=Cu2,
    Cl1>=Cl2,
    \+ (Ca1=Ca2,Co1=Co2,Cu1=Cu2,Cl1=Cl2).

% Diamond 'Id' dominates a set of properties.
dominates(Id,[Ca,Co,Cu,Cl]) :-
    diamond(Id,_,Ca_,Co_,Cu_,Cl_),
    \+ (Ca=Ca_,Co=Co_,Cu=Cu_,Cl=Cl_),
    Ca_ >= Ca,
    Co_ >= Co,
    Cu_ >= Cu,
    Cl_ >= Cl.

% Diamond 'Id' is dominated by a set of properties.
dominated(Id,[Ca,Co,Cu,Cl]) :-
    diamond(Id,_,Ca_,Co_,Cu_,Cl_),
    \+ (Ca=Ca_,Co=Co_,Cu=Cu_,Cl=Cl_),
    Ca >= Ca_,
    Co >= Co_,
    Cu >= Cu_,
    Cl >= Cl_.

There is doubtless lots of other domain knowledge about how the factors relate to each other which could make the analysis more power. For example, knowing that some ordinal categories combinations are equivalent would allow the domination relation above to establish tighter bounds. Further, additional domain information provides more opportunity to test the data for consistency.

Consistency and coverage

At this stage, I have some domain information and a description of the dataset from the source,which I can use to check cross-check the data. I’ll start with some range checks.

It is given that:

price
    price in US dollars ($326–$18,823)

carat
    weight of the diamond (0.2–5.01)

The prices are within the expected range.

>> price(X,Price),(18823<Price;Price<326).
false.

However, there are many diamonds which are outside the specified carat range:

>> findall(X,(carat(X,Car),(5.01<Car;Car<0.2)),Xs),length(Xs,N).
Xs = [404, 771, 1165, 2052, 2109, 2371, 2377, 2420, 2817|...],
N = 153.

I will use this information to modify the filters in the ETL script.

Next, I check to see to what extent dominates/2 is consistent with prices. I need a few auxiliary predicates:

% Rough quantiles
quantile(S,Q,V) :-
    length(S,N),
    Idx is round(Q*(N-1)),
    nth0(Idx,S,V).

% # of diamonds inconsistent with X.
breach_(X,Y) :-
    dominates(X,Y),
    price(X,Px),
    price(Y,Py),
    Px<Py,
    1.1<(Py/Px).
breach(X,N) :-
    aggregate(count,Y^breach_(X,Y),N).

Note that I require that breaches have a a price difference of more than 10 per cent. This is because I expect that there is both a natural variance in prices and many confounding effect from variables not considered in the analysis. I calculate the quantiles as follows:

>> findall(N,(diamond(X,_,_,_,_,_),breach(X,N)),Xs),
>> msort(Xs,S),
>> member(Q,[0.25,0.50,0.75,0.80,0.90,0.95,0.99]),
>> quantile(S,Q,V).
Q V
0.25 23
0.50 45
0.75 70
0.80 75
0.90 105
0.95 129
0.99 219

The 99th percentile value (219) is about 1.6% of the dataset (219/13408): a surprisingly consistent result given pricing variance and the likelihood that these diamonds were not priced simultaneously, in the same place, and under the same conditions. Note here that the smoothing and de-duplication effects of grouping the data on the 4Cs likely improves its consistency significantly by implicitly removing outliers.

I’ll now check to see what the distribution of diamond looks like across combinations of ordinal categories. Here are some auxiliary predicates:

unique(Pred,Values) :-
    findall(X,call(Pred,_,X),Xs),
    sort(Xs,Values).

count(Co,Cu,Cl,N) :-
    aggregate_all(count,diamond(_,_,_,Co,Cu,Cl),N),!.
count(_,_,_,0).

I first look for combinations of ordinal categories which do not have any diamonds:

>> unique(colour,Cos),
>> unique(clarity,Cls),
>> unique(cut,Cus),
>> member(Co,Cos),
>> member(Cu,Cus),
>> member(Cl,Cls),
>> count(Co,Cu,Cl,N),
>> N = 0.

It returns 5 results, so there are such categories. This is not a problem for the pricing model herein but it may be for other models (e.g. trees).

Next, I look at the quantiles to see how the diamonds are distributed across categories:

>> unique(colour,Cos),
>> unique(clarity,Cls),
>> unique(cut,Cus),
>> findall(
>>   N,
>>   (member(Co,Cos),
>>    member(Cu,Cus),
>>    member(Cl,Cls),
>>    count(Co,Cu,Cl,N)),
>>   Ns),
>> msort(Ns,S),
>> member(Q,[0.05,0.10,0.25,0.50,0.75,1]),
>> quantile(S,Q,V).
Q V
0.05 3
0.10 7
0.25 15
0.50 37
0.75 81
1.00 143

There are 280 category combinations, of which 10% have 7 diamonds or less. Conversely, the last quartile has 81-143 diamonds. This does not effect the model herein, but for many other kinds of model, it would result in a bias towards the bigger categories, unless it was explicitly corrected for during training.

Finally, I look at the price distribution:

>> findall(Pr,price(_,Pr),Ps),
>> sort(Ps,S),
>> member(Q,[0.05,0.10,0.25,0.50,0.75,1]),
>> quantile(S,Q,V).
Q V
0.05 760
0.10 1171
0.25 2680
0.50 5666
0.75 10422
1.00 18806

I note that there is a considerable skew towards higher prices. This is again not a problem for the model herein, but it is for other types of model because regressions are often fitted using sum-of-squares or other objective functions which will be skewed to the higher prices because they account for more of the total error.

A pricing model

Here is a first pricing model:

price_check(Ca,Co,Cu,Cl,Type,Lower,Upper,Evidence) :-
    Type="Exact match: min/max.",
    findall(Id,diamond(Id,Pr,Ca,Co,Cu,Cl),Evidence),
    Evidence \= [],
    findall(Pr,(member(Id,Evidence),price(Id,Pr)),Ps),
    min_list(Ps,Lower),
    max_list(Ps,Upper).

Given a set of 4Cs, the model will find all diamonds which have exactly the same properties. It will then return the min/max price in the list as the lower/upper bounds for the price of the diamond. If an exact match is available, this will provide the best estimate. Note also that Type and Evidence variables are also available, which provide a short description of the pricing strategy and the diamond IDs used in its calculation.

If an exact match is not available, the following model will use the domination relationships to find diamonds in the dominated categories immediately above and below the specified 4Cs, and then it will extract the 25th and 75th quantiles of the resulting prices respectively to calculate bounds:

price_check(Ca,Co,Cu,Cl,Type,Lower,_,Evidence) :-
    Type="Domination: lower bound.",
    Cat=[Ca,Co,Cu,Cl],
    findall(
	 Id,
	 (dominated(Id,Cat),\+ (dominates(Z,Id),dominated(Z,Cat))),
	 Evidence),
    findall(Pr,(member(Id,Evidence),price(Id,Pr)),Ps),
    sort(Ps,Ps_sort),
    quantile(Ps_sort,0.75,Lower).
price_check(Ca,Co,Cu,Cl,Type,_,Upper,Evidence) :-
    Type="Domination: upper bound.",
    Cat=[Ca,Co,Cu,Cl],
    findall(
	Id,
	(dominates(Id,Cat),\+ (dominates(Id,Z),dominates(Z,Cat))),
	Evidence),
    findall(Pr,(member(Id,Evidence),price(Id,Pr)),Ps),
    sort(Ps,Ps_sort),
    quantile(Ps_sort,0.25,Upper).

The lower and upper bounds are calculated separately because at the edge of the data, only one of the them may be available. The 25th and 75th quantiles are used because the data is not entirely consistent (as shown in a previous section), and quantiles are less likely to produce spurious results than min/max.

Here is an example of usage:

>> price_check(0.5,-3,2,-3,Type,Lower,Upper,Evidence).
Type = "Exact match: min/max.",
Lower = Upper, Upper = 1865,
Evidence = [4015] ;

Type = "Domination: lower bound.",
Lower = 1740,
Evidence = [2726, 3423, 3776] ;

Type = "Domination: upper bound.",
Upper = 2068,
Evidence = [4269, 4375, 4562].

There is always at least one domination bound but in this case an exact match and all bounds are available, so the query returns all of them. Here is another example where an exact match is not available:

>> price_check(0.8,-3,4,-3,Type,Lower,Upper,Evidence).
Type = "Domination: lower bound.",
Lower = 3813,
Evidence = [5599, 6391, 6761, 6955] ;

Type = "Domination: upper bound.",
Upper = 4229,
Evidence = [7102, 7213, 7230].

WHY IS THIS BETTER?

Here are just some of the reasons I think that this approach works better for this dataset than any other analysis I’ve seen:

  • The model is just a query which combines domain knowledge with the available data; it is almost entirely empirical.

  • The model is not “trained”; it just reports on what is in the data and the domain knowledge. This avoids many problems highlighted through the text such as missing category combinations during training, skews in the prices affecting the objective function, and bias due to under-represented categories. All of these issues cause potentially serious biases in other popular types of model (xgboost, random forests, NN, linear, etc), and must be addressed explicitly.

  • It is easy to explain: here is a database of prices, the computer calculates price bounds using: (1) diamonds which have the same 4Cs as the one you’re looking for, and (2) diamonds with slightly better/worse 4Cs than the one you’re looking for.

  • All pricing strategies are presented with the evidence. This deep case-by-case explanation makes it possible for a human to decide whether to believe the computer through common sense reasoning. Even models as simple as a linear regression a kNN cannot provide this level of transparency.

  • It is easy to improve the model by: (1) making the data more recent and representative, (2) adding more domain knowledge (e.g. about equivalence classes), and (3) adding more pricing strategies as additional price_check/8 clauses.

CONCLUSIONS

I’ve analysed the diamonds dataset primarily using symbolic reasoning. I’ve integrated the data with domain knowledge and produced a useful empirical and extensible pricing model in a few lines of Prolog. Hopefully I have gone some way to convince you – the reader – that data analysis is better with logic programming.