The Moore-Penrose inverse

Linear algebra is a fairly known quantity. It has a rich theory, there are lots of useful results, much of the key stuff is not so hard to understand, there are lots of geometric analogies, and so on. If you can get your data into a matrix format, you’re off to the races with just the linear algebra toolbox. All the lovely things I use daily are mostly linear algebra – PCA, SVD, factor analysis, LDA, GLMs, GAMs, SVMs, etc – but I forget the fundamental, elementary stuff. So, I’ll write about interesting bits in an effort to keep it fresh.

Without further ado, let me introduce the topic of today’s post: the Moore-Penrose (MP) inverse. If you’ve got an equation like \(Ax = b\) where \(x,b\) are vectors and \(A\) is a matrix, then you’d solve for \(x\) by multiplying both sides by the inverse matrix \(A^{-1}\), thus ending up with \(x = A^{-1}b\). If \(A\) is a full rank matrix that all works out fine, but what about when \(A\) is indeterminate (e.g. rank deficient)? For example, say I have a data matrix \(X\) and I’m fiddling about trying to understand what each data column means for the whole. I might try and factor \(X\) into something like \(X=A_iB\), where \(A_i\) is the same as \(X\) but with column \(i\) removed, and \(B\) is some matrix to be discovered. Following the recipe above, writing down \(A_i^{-1}X = B\) makes little sense in the general case unless it just so happens that \(A_i\) is full rank. I’d expect instead that in most cases \(A_iB\) will have to be an approximation of \(X\). So, instead I’m looking to pick \(B\) to minimise \(||X-A_iB||\), where \(||\cdot||\) is a measure of matrix similarity such as the Frobenius norm: this is exactly what the MP inverse does. The MP inverse of \(A\) is written \(A^+\), when \(A\) is full rank then \(A^+ = A^{-1}\), whilst in all other cases it minimises the Frobenius norm. The MP inverse is available for all \(A\) and it is always unique.

Now I can calculate \(A_i^+X = B\), and then measure \(||X-A_iB||\) to gauge the effect of column \(i\) on the approximation of \(X\): the worse the norm, the less the remaining columns are able to make up for the missing column.

Here is some data about “Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888”.

str(datasets::swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

Now, lets drop one column at a time, calculate an approximation of \(X\) using the MP inverse and take the Frobenius norm to see how dropping column affects approximation.

X = datasets::swiss %>% as.matrix %>% scale

tibble(
  column=colnames(X),
  norm=seq(1:NCOL(X)) %>% 
    map(\(i) norm(X - X[,-i] %*% pinv(X[,-i]) %*% X, "F")) %>%
    unlist) %>%
  arrange(-norm)

## # A tibble: 6 × 2
##   column            norm
##   <chr>            <dbl>
## 1 Infant.Mortality  5.90
## 2 Catholic          4.43
## 3 Agriculture       4.19
## 4 Fertility         3.67
## 5 Examination       3.49
## 6 Education         3.27

Et voila, a spot of linear algebra from (close to) first principles, made useful.