Metric learning with linear methods

I read this paper a while ago, which sets out the problem of linear metric learning nicely but then proceeds to solve it in a way I personally thought unnecessarily indirect. It seemed to me that there was a neat analytical solution, so I thought I’d have a go. It turned out to be quite straightforward.

Say we have some feature vectors \(x_i \in \mathbb{R}^p\) and some responses \(y_i \in \mathbb{R}^k\), I want:

\[(Ax_i - Ax_j)^\top (Ax_i - Ax_j) \approx (y_i -y_j)^\top (y_i -y_j)\]

where \(A\) is a \(p \times p\) matrix. That is, let’s find a transformation \(A\) which makes the sum of squared difference between \(i\) and \(j\) similar, regardless of whether its calculated in terms of \(x\) or \(y\). Call this act of changing our feature vectors to better match our responses, “metric learning”. It is interesting because many algorithms implicitly rely on similarity between feature vectors (e.g. Kernel methods, kNN), so they can be improved by making the distance metric more relevant to the problem at hand.

Note that:

\[(Ax_i - Ax_j)^\top (Ax_i - Ax_j) = (x_i-x_j)^\top A^\top A(x_i-x_j)\]

We can set \(B = A^\top A\), and then use the Kronecker product to vectorise it:

\[(x_i-x_j)^\top B(x_i-x_j) = \vec{B}^\top\bigg[(x_i-x_j) \otimes (x_i-x_j)\bigg]\]

Let \(K\) be an \(m \times p^2\) matrix where each row the Kronecker product of a pair of vectors \((x_i-x_j) \otimes (x_i-x_j)\), and let \(z\) be the vector containing the corresponding \((y_i - y_j)^\top(y_i - y_j)\) value, then it turns out we can solve \(K \vec{B} \approx z\) for \(\vec{B}\) in the usual least squares way, and then reshape \(\vec{B}\) back to \(B\).

As a final step we need to approximate \(A\) from \(B\) using the identity \(B = A^\top A\), so that we can use \(A\) to transform \(x_i\). One general way is to use SVD: \(B = U^\top \Sigma V\). We can then set \(A = U\sqrt{\Sigma}\), such that \(A^\top A = U^\top \Sigma U \approx B\).

Let’s test it in R. First I’ll generate some data. I’ll make \(Y\) a subset of \(X\), so the optimal transform should just zero out the rest of the components in \(X\).

P <- 10
N <- 100
X <- matrix(rnorm(P*N, mean=1:P, sd=sqrt(1:P)), nrow=N)
Y <- X[,c(1,3,7,9)]

Next, I need the indexes of some sample pairs.

I <- matrix(sample(1:N, N*P, replace=T), ncol=2)
I <- I[I[,1] > I[,2],]
J <- I[,2]
I <- I[,1]

I fit the pairs as described above and calculate the transformation \(A\):

require(MASS, quietly=T)

K <- mapply(\(i,j) kronecker(X[i,]-X[j,], X[i,]-X[j,]),
            I, J, SIMPLIFY=FALSE)
K <- do.call(rbind, K)
z <- matrix(
  mapply(\(i,j) t(Y[i,]-Y[j,]) %*% (Y[i,]-Y[j,]), I, J),
  ncol=1)
B <- matrix(ginv(K) %*% z, nrow=P, byrow=FALSE)
S <- svd(B)
A <- S$u %*% diag(sqrt(S$d))

Let’s compare the sum of squared differences in the original and transformed space:

X_ <- X %*% A

z0 <- matrix(
  mapply(\(i,j) t(X[i,]-X[j,]) %*% (X[i,]-X[j,]), I, J),
  ncol=1)

z1 <- matrix(
  mapply(\(i,j) t(X_[i,]-X_[j,]) %*% (X_[i,]-X_[j,]), I, J),
  ncol=1)

data.frame(
  label=c("Original", "Transformed"),
  abs.err=round(c(mean(abs(z-z0)),mean(abs(z-z1))), 5))

##         label  abs.err
## 1    Original 166.8362
## 2 Transformed   0.0000

As required, \(A\) represents an optimal transform.