Summary
I present a weak supervision paradigm called “data programming” which uses maximum likelihood estimation to produce soft labels from heuristics. These soft labels can then be used to train other models, without true labels being required at any stage. I’ve included a simple example from first principles to show that the methods work. The original authors have a fully featured package called Snorkel which provides sophisticated data programming and related features.
Introduction
There’s a nice paper from 2016 called “Data Programming: Creating Large Training Sets, Quickly” in which the authors lay out a simple paradigm for training binary classification models from a set of heuristic “labeling functions”. Given some data \(x\), a labeling function \(f_i(x) \in \{-1,0,1\}\) abstains (0) at an unknown rate \(\beta_i\), and correctly matches the true (but unknown) label (-1 or 1) at a rate \(\alpha_i\). In plain language, a labeling function outputs labels for some data samples – but not necessarily all – and they might be wrong. The objective is to use a set of labeling functions to guess the true labels, and then train a model with that.
To get this off the ground, we must assume that our labeling functions are correlated with the true labels, even though we do not know how often they apply nor how often they will be correct. Most importantly, and conversely, this implies that the true labels are correlated with the labeling functions, which is what allows us to estimate them from the labeling functions.
We have to get a little less handwavy at this point. Conceptually we need to start by estimating \(P(\Omega(x) \mid Y=y)\), where \(\Omega(x) \in \{-1,0,1\}^m\) are our \(m\) labeling functions applied to data \(x \in X\), and \(y\in Y\) are the true but unknown labels. Once we’ve done that, using Bayes rule, we can calculate \(P(Y=y \mid \Omega(x)) \propto P(\Omega(x) \mid Y=y)P(Y=y)\), which gives us the soft labels we need to train some other model directly on \(X\).
Step 1: Estimating the likelihood function
We’ll assume that the labeling functions are independent, and that our prior on class probabilities is uniform. We then define a joint probability distribution over the labeling functions and true values:
\[\mu_{\alpha,\beta}(\Omega,y) = \frac{1}{2} \prod_{i=1}^m \big( \beta_i\alpha_i 1_{\Omega_i=y} + \beta_i(1-\alpha_i) 1_{\Omega(x)_i=-y} + (1-\beta_i) 1_{\Omega_i=0} \big)\]
where \(\Omega \in \{-1,0,1\}^m\) and \(y \in \{-1,1\}\). This is just a product of the probabilities of the true label (\(\frac{1}{2}\)) happening, and the labeling functions matching it. We can then write down the log likelihood function:
\[ \log \mathcal{L}(\alpha, \beta) = \sum_{x \in X} \log \left( \sum_{y \in \{-1, 1\}} \mu_{\alpha, \beta}(\Omega(x), y) \right) \]
To fit \(\alpha,\beta\), we \(\arg\max_{\alpha,\beta} \log \mathcal{L}(\alpha,\beta)\).
Note that the likelihood function isn’t identifiable. For example, we could flip \(\alpha_i\) to \(1-\alpha_i\) and achieve the same likelihood. We have to add some information about the range of \(\alpha,\beta\) to break the symmetry if we want an interpretable fit. We also assume explicitly that our labeling functions are independent, which is unlikely in practice. The paper provides a more complicated approach to modelling the joint distribution using factor models which I will not cover here.
Step 2: Estimate the soft labels
At this stage we have \(\alpha,\beta\) estimates and therefore a complete joint distribution, so we have everything we need to apply Bayes rule, calculate the posterior and obtain the soft labels:
\[P(Y = 1 \mid \Omega(x)) = \frac{\mu_{\alpha,\beta}(\Omega(x), 1)}{\mu_{\alpha,\beta}(\Omega(x), -1) + \mu_{\alpha,\beta}(\Omega(x), 1)}\] That’s all there is to it because its a binary classification task, so the probabilities are discrete. It could be significantly more involved if the response was continuous.
Step 3: Training a model
The paper fits a logistic regression to the soft labels, but its simpler and more elucidating to fit a linear probability model instead. That is, treat the soft label \(\eta(x) = P(Y = 1 \mid \Omega(x)))\) as the regression target and fit a linear model with an \(L2\) penalty:
\[\min_w \sum_{x \in X} \left( w^\top f(x) - \eta(x) \right)^2 + \rho \|w\|^2\] where \(f(x)\) is the feature vector for input \(x\), and \(\rho\) is the regularization parameter. The solution has a closed form:
\[w^* = (F^\top F + \rho I)^{-1} F^\top \eta\]
where \(F\) is the matrix of feature vectors and \(\eta\) is the vector of soft labels. This gives an interpretable baseline model trained without true labels.
Note that the regularisation term \(\rho \|w\|^2\) is important. The soft labels are inherently noisy, and a linear model might over-fit especially if some labeling functions are highly correlated or unreliable. The penalty term mitigates this by shrinking the model weights toward zero, which helps discourage fitting to spurious patterns.
Any other kind of model could be fitted at this point too, it just needs to be able to use the soft labels as a response variable directly.
An example
Here is a proof of concept example in R using the methods described above.
I’ll use the BreastCancer
dataset from the mlbench
package since its well
suited to binary classification and has a bunch of features to draw heuristics
from.
data(BreastCancer) # mlbench package.
X <- BreastCancer[complete.cases(BreastCancer),] %>%
select(2:11) %>%
mutate ( Class = ifelse(Class=="benign",-1,1)) %>%
mutate(across(everything(), ~ as.numeric(.)))
y <- ifelse(X$Class==-1,F,T)
X$Class <- NULL
I asked ChatGPT to produce 5 domain inspired labeling functions which I encode
in the H
variable. Note that every function has some degree of abstention.
H <- X %>% mutate(
lf1 = case_when(
Cl.thickness >= 6 ~ 1,
Cl.thickness <= 3 ~ -1,
TRUE ~ 0
),
lf2 = case_when(
Cell.size >= 5 ~ 1,
Cell.size <= 2 ~ -1,
TRUE ~ 0
),
lf3 = case_when(
Bare.nuclei >= 6 ~ 1,
Bare.nuclei <= 3 ~ -1,
TRUE ~ 0
),
lf4 = case_when(
Normal.nucleoli >= 7 ~ 1,
TRUE ~ 0
),
lf5 = case_when(
Mitoses >= 3 ~ 1,
Mitoses == 1 ~ -1,
TRUE ~ 0
)
) %>% select(lf1,lf2,lf3,lf4,lf5) %>% as.matrix()
I can now write down the joint distribution and the likelihood function and then argmax it with respect to \(\alpha,\beta\) using the \(H\) matrix.
joint <- \(H,y,A,B) {
prod((H==y)*B*A+(H==-y)*B*(1-A)+(1-B)*(H==0))
}
fit <- function(H) {
m <- ncol(H)
obj <- \(P) {
A <- P[1:m]; B <- P[(m+1):(2*m)]
-sum(apply(H,1,\(h) log(1e-12 + joint(h,1,A,B) + joint(h,-1,A,B))))
}
init <- runif(2 * m, 0.5, 1)
lower <- rep(0, 2 * m)
upper <- rep(1, 2 * m)
R <- optim(init,obj,method="L-BFGS-B",lower=lower,upper=upper)
list(A=R$par[1:m],B=R$par[(m+1):(2*m)],value=-R$value)
}
res <- fit(H)
res
## $A
## [1] 0.9258932 0.9896234 0.9377637 0.9644975 0.8109640
##
## $B
## [1] 0.6969184 0.8682388 0.9282592 0.1668999 0.9487493
##
## $value
## [1] -1919.901
Next, I use the posterior distribution, calculated using Bayes rule, to
determine the soft labels S
, and then fit a ridge regression in closed form
using \(\rho=1\). Finally, I apply the fitted weights W
to predict class
probabilities from the soft labels. I clip the probabilities to [0,1] since a
linear probability model can make out of range predictions.
S <- (\() {
num <- apply(H, 1, \(h) joint(h, 1, res$A, res$B))
den <- num + apply(H, 1, \(h) joint(h, -1, res$A, res$B))
num / den })()
W <- (\(){
X_ <- as.matrix(X)
# rho = 1
solve(t(X_) %*% X_ + 1 * diag(ncol(X_)), t(X_) %*% S)
})()
p <- pmin(pmax(as.matrix(X) %*% W, 0), 1)
Finally, I use the caret
package confusionMatrix
function to produce a
classification report.
# caret package.
confusionMatrix(factor(ifelse(p>0.5,T,F)), factor(y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 435 23
## TRUE 9 216
##
## Accuracy : 0.9531
## 95% CI : (0.9345, 0.9677)
## No Information Rate : 0.6501
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8956
##
## Mcnemar's Test P-Value : 0.02156
##
## Sensitivity : 0.9797
## Specificity : 0.9038
## Pos Pred Value : 0.9498
## Neg Pred Value : 0.9600
## Prevalence : 0.6501
## Detection Rate : 0.6369
## Detection Prevalence : 0.6706
## Balanced Accuracy : 0.9417
##
## 'Positive' Class : FALSE
##
The results will be better or worse dependent on the labeling function coverage and precision, but I think the example demonstrates that the method can work well even with a simple baseline model.
Conclusion
I think the paradigm is inspired, because I often find myself without true labels but with enough domain knowledge to produce labeling functions. I think it’d also be fairly easy to mix in true labels at the likelihood function in a semi-supervised approach. The authors have a Python package called Snorkel which provides sophisticated data programming and related features.