Introduction

I had recently started going to a new specialist gym that runs 3 classes per day during the working week and is closed the rest of the time. I’ve been at a few different times on a few different days, and already I was seeing many of the same people from the first class. It occurred to me that the chance of seeing the same faces should somehow scale with the number of people going to the gym, hence it may be possible to estimate the total number of gym members from the number of people I repeatedly see. I then realised that sounds very much like a mark and recapture (MR) experiment from Ecology, so I did a little bit of research to see how difficult it would be to estimate the gym members using MR techniques.

Lincoln–Petersen estimator

In the most elementary version of an MR experiment the scientist randomly samples from a population and marks a certain number of individuals. He then resamples the same population and counts the number of tagged individuals which have been “recaptured”. The population is assumed to be the same in both cases, and the samples are assumed uniformly random. A property of random samples is that they are proportion preserving, which implies that the fraction of tagged individuals in the sample should be the same as that in the population. Let \(N,n,T,t\) be the population size, the sample size, the number of tagged individuals and the number of tagged individuals recaptured respectively. Proportion preserving implies that \(\frac{n}{N} \approx \frac{t}{T}\) and thus \(N \approx \frac{nT}{t}\). This is known as the Lincoln–Petersen estimator.

Applying this estimator directly to my problem would require several auspicious assumptions, namely that the gym goer population is static, that people mostly go to the gym only once a week and randomly pick the day and the time. In my case, it is likely that the gym population over a period of 2-3 weeks is close to static since it is a specialist gym, but it is unlikely that the members pick when they’ll go at random. I think typically people follow a routine, I suspect most will go more than once a week, very few will go more than once a day, and the frequency of visits will likely differ between members.

Poisson regression

In order to take into account a more realistic set of assumptions I will need a more flexible model. One way of doing it is to cast the problem into a linear model, and then borrow from the bag of tricks available from the design of experiments literature to make adjustments. To that effect, recall that a log-linear model has the underlying form \(y = e^{\beta X + \epsilon}\), where \(y,\beta,X,\epsilon\) are the response, coefficients, design matrix and error term respectively. Using log as a link function I get \(log(y)= \beta X + \epsilon\): a log-linear model. A Poisson regression assumes that \(y\) is a count variable and that the term \(\beta X + \epsilon\) is interpreted as a rate of occurrence. However, I can also model rates directly with a Poisson regression since if \(\frac{y}{n}=e^{\beta X+\epsilon}\) then \(log(y)=log(n)+\beta X+\epsilon\) and I am back to a log-linear model. This trick is called “offsetting”, and it is directly supported by the R glm function using the offset function within a formula.

I can express an equivalent Poisson regression to the Lincoln-Petersen estimator above as follows. Let \(n=32\) (sample size) and \(t=5\) (recaptures), then in R:

> data.frame(t=5, n=32) %>%
> glm(t ~ offset(log(n)), data=., family="poisson") %>%
> summary


Call:
glm(formula=t~offset(log(n)), family="poisson", data=.)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8563     0.4472  -4.151 3.31e-05 ***
...

The log transformed estimate for the recapture probability is given by the intercept \(\beta_0\) which is the same as the response since it is a saturated model. That is, \(exp(\beta_0)\approx\frac{t}{n}\). I can get a population estimate by putting \(\beta_0\) back into the Lincoln-Petersen equation: \(N\approx\frac{nT}{t} = \frac{T}{\beta_0}\). The reason this is interesting is because now I can adjust the intercept by adding variables to take other factors into account. For example, say I sampled faces to remember the first week at the gym but then calculated recaptures for the following two weeks. That is, \(T\) stays the same but \(t_i,n_i\) become time indexed. I couldn’t just pool the following two weeks because I’d likely recapture some members more than once. The Poisson regression gives us a way to deal with this situation. For example, let \(n_1=32,n_2=18,t_1=5,t_2=3\) then:

> data.frame(t=c(5,3), n=c(32,18)) %>%
> glm(t ~ offset(log(n)), data=., family="poisson") %>%
> summary


Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8326     0.3536  -5.183 2.18e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...

This takes me beyond the Lincoln-Petersen estimator by taking into account two recapture samples with different sample sizes, but I can still use the \(N\) formula to calculate a population estimate. In the next section I deal with trickier assumptions.

Blocking

In the case of my gym, the main issues I see with using the Lincoln-Petersen estimator is that (1) people tend to visit the gym on a personal schedule (not randomly), and (2) they do not all visit the gym equally often. Issue (1) implies that the population which would be sampled may differ dependent on the day/time. A practical way to deal with this is to sample these sub-populations as separate “blocks” within which member habits vary less. I account for blocks by adding variables to represent the effect of the block in contrast to the overall mean. I can achieve this using so-called deviation coding which will add \(k-1\) new variables (given \(k\) blocks). The model becomes: \[log(t_i) = log(n_i) + \beta_0 + \beta_{b_1}b_1 + ... + \beta_{b_{k-1}}b_{k-1}\] where \(b_1...b_{k-1}\) are contrast coded variables (in this case 1, -1) representing block membership. Note that I need at least as many samples as there are parameters in the model. The net effect is that my intercept estimate will not inherit the variance attributable to block membership. How to decide the blocks depends on problem specifics and it may require separate experiments. In my case, I’ll make an educated guess:

Monday/Wednesday/Friday evening – People training 2-3 times a week. I think this fits most people and these tend to be the over-subscribed classes. It makes sense to book classes a day apart since most people want a rest-day between workouts.
Monday/Wednesday/Friday lunch-time – less popular since it implies significant flexibility at lunch-time, but I would not expect to see these same people at the evening classes.
Tuesday/Thursday evening – For those that prefer smaller crowds and go less frequently.
Tuesday/Thursday lunch-time – For those that prefer smaller crowds and go less frequently.

The main problem in issue (2) is that the probability of sampling an individual is not uniform, and to a great extent, it should be alleviated by blocking because it results in a set of more homogeneous populations intra-block. In my case, I would expect that most gym goers will go at a frequency close to the mean (say 2.5 times a week) and that the under/over sampled will tend to be outliers. To that extent, the problems they represent can also be alleviated by a larger sample (or more samples) wherein frequent attendees will have proportionally less effect and infrequent attendees will be better accounted for.

Simulation

I will simulate some data featuring all the complications above to see how well I can recover it using progressively more complicated versions of the model. Ultimately, these models imply experimental designs so each has a corresponding sampling scheme. Here is the code to generate 4 weeks of attendance data for \(N=230\) split into 5 groups of different sizes, each with separate attendance schedules, with a maximum class size of 32:

require(dplyr)

MAX_CLASS <- 32

gen <- function(ids, days, times, visits) {
  lapply(ids, \(i) {
    v <- sample(visits,1)
    tibble(
      id = i,
      day = sample(days, v, replace=F),
      time = sample(times, v, replace=T))
  }) %>% bind_rows
}

gen_week <- function(week) {
  bind_rows(
    gen(1:30, c("Tu","Th"), c("12","12"), c(1,2,2,2)),
    gen(31:70, c("Tu","Th"), c("17","18"), c(1,2,2,2)),
    gen(71:120, c("Mo","We","Fr"), c("12","12"), c(2,3,3)),
    gen(120:200, c("Mo","We","Fr"), c("17","18"), c(2,3,3)),
    gen(200:230, c("Mo","Tu","We","Th","Fr"), c("12","17","18"), c(1,2,3))
    ) %>%
    mutate(week=week)
}

data <-
  lapply(c(1,2,3,4), gen_week) %>%
  bind_rows %>%
  group_by(week, day, time) %>%
  sample_n(size = min(MAX_CLASS, n())) %>%
  ungroup

The groups in the data are well separated except for the last group which overlaps with them all. The resulting data looks like this:

# A tibble: 1,675 × 4
      id day   time   week
   <int> <chr> <chr> <dbl>
 1    73 Fr    12        1
 2    82 Fr    12        1
 3   117 Fr    12        1
 4    85 Fr    12        1
 5    74 Fr    12        1
 6    76 Fr    12        1
 7    98 Fr    12        1
 8    90 Fr    12        1
 9    91 Fr    12        1
10    87 Fr    12        1
# ℹ 1,665 more rows

I’ll start naively with the Lincoln-Petersen estimator. Say I attend 4 random classes in Week 1 and mark 20 per cent of the individuals I see and then do the same in Week 2 and count the recaptures. I can simulate this as follows:

lincoln_petersen_estimator <- function(i) {

  marked <- data$id[
    ((data$day == "Mo" & data$time == "12") |
      (data$day == "Tu" & data$time == "17") |
      (data$day == "We" & data$time == "12") |
      (data$day == "Fri" & data$time == "17")) &
      data$week == 1] %>%
    unique %>%
    sample(size=.2*NROW(.), replace=F)
  
  seen <- data$id[
    ((data$day == "Tu" & data$time == "18") |
      (data$day == "We" & data$time == "17") |
      (data$day == "Th" & data$time == "17") |
      (data$day == "Fr" & data$time == "12")) &
      data$week == 2] %>% 
    unique

  N <- NROW(unique(data$id))
  T <- NROW(marked)
  t <- NROW(intersect(marked, seen))
  n <- NROW(seen)
  n*T/t
}

The function above will sample the data and recalculate the Lincoln-Petersen population estimate. We can invoke it 10000 times to get a sense of how well this scheme would perform:

> est <- sapply(1:10000, lincoln_petersen_estimator)
> print(quantile(est, c(.025,.5,.975)))

    2.5%      50%    97.5% 
127.2727 175.0000 280.0000

The median severely underestimates the true population. This is likely the case because the sampling schedule does not span the member schedules, and as a result the population is under-sampled.

I next add the blocking variables to see how the estimate changes. Utilising blocking also changes the sampling scheme. I can still just attend 4 sessions in the first week, but they need to be spread out equally across the blocks. Here is the sampling scheme and the model in code:

poisson_estimator <- function(i) {
  
  marked <- data$id[
    ((data$day == "Tu" & data$time == "12") |
      (data$day == "Th" & data$time == "18") |
      (data$day == "Mo" & data$time == "18") |
      (data$day == "Fr" & data$time == "12")) &
      data$week == 1] %>%
    unique %>%
    sample(size=.2*NROW(.), replace=F)
  
  seen <- filter(
    data,
    ((day == "Th" & time == "12") |
      (day == "Tu" & time == "18") |
      (day == "Fr" & time == "17") |
      (day == "We" & time == "12")) &
      week == 2) %>%
    group_by(week,day,time) %>%
    summarise(
      t=NROW(intersect(id,marked)),
      n=NROW(id),
      .groups="drop")
  
  map_group <- function(day, time) {
    if(day %in% c("Mo","We","Fr") & time == "12") return("MoWeFr12")
    if(day %in% c("Tu","Th") & time == "12") return("TuTh12")
    if(day %in% c("Mo","We","Fr") & time %in% c("17","18")) return("MoWeFr1718")
    if(day %in% c("Tu","Th") & time %in% c("17","18")) return("TuTh1718")
  }

  seen$group <- factor(mapply(map_group, seen$day, seen$time))
  contrasts(seen$group) <- contr.sum(4)
  m <- glm(t ~ offset(log(n)) + group, data=seen, family="poisson")

  if (summary(m)$coefficients[,"Pr(>|z|)"][1] < .05) {
    T <- NROW(marked)
    N <- as.numeric(T/exp(coefficients(m)[1]))
    return(N)
  }
  return(NA)
}

As before, we can re-run this experiment 10000 times to get a sense of how the distribution depends on the sample:

> est <- sapply(1:10000, poisson_estimator)
> print(paste("Failed experiments", sum(is.na(est)), "of", NROW(est)))
> print(quantile(est, c(.025,.5,.975), na.rm=T))

[1] "Failed experiments 2281 of 10000"

    2.5%      50%    97.5% 
172.3297 239.5435 358.2011

The estimate looks very good from what ends up being only about 21 marked individuals. Note that because there as many data points as parameters, just a single zero recapture will result in an over-saturated model and spurious estimates. Note that the intercept had a p-value above 5% in about 23 per cent of cases, which presents a practical problem. However, If I should additionally recapture on week 3 (and therefore double the data points) the results look like this:

[1] "Failed experiments 353 of 10000"

    2.5%      50%    97.5% 
180.6158 237.7038 345.1362

The median and the confidence interval are sharpened and the number of failed experiments drop to less than 4 per cent.

Complications aside, this situation in which I’ve managed to guess the blocks just-so, is still auspicious. There is no way of getting around representative sampling, but that does not imply that one must know the sub-populations. Blocking will work to some extent if the inter-block recapture rate varies less than it does between blocks. In the code below I have used Mon/Wed/Fri versus Tue/Thu as the only blocking factor:

poisson_estimator2 <- function(i) {
  
  marked <- data$id[
    ((data$day == "Tu" & data$time == "12") |
       (data$day == "Th" & data$time == "18") |
       (data$day == "Mo" & data$time == "18") |
       (data$day == "Fr" & data$time == "12")) &
      data$week == 1] %>%
    unique %>%
    sample(size=.2*NROW(.), replace=F)
  
  seen <- filter(
    data,
    ((day == "Th" & time == "12") |
       (day == "Tu" & time == "18") |
       (day == "Fr" & time == "17") |
       (day == "We" & time == "12")) &
      week == 2) %>%
    group_by(week,day,time) %>%
    summarise(
      t=NROW(intersect(id,marked)),
      n=NROW(id),
      .groups="drop")
  
  map_group <- function(day, time) {
    if(day %in% c("Mo","We","Fr")) return("MoWeFr")
    if(day %in% c("Tu","Th")) return("TuTh")
  }
  
  seen$group <- factor(mapply(map_group, seen$day, seen$time))
  contrasts(seen$group) <- contr.sum(2)
  m <- glm(t ~ offset(log(n)) + group, data=seen, family="poisson")
  if (summary(m)$coefficients[,"Pr(>|z|)"][1] < .05) {
    T <- NROW(marked)
    N <- as.numeric(T/exp(coefficients(m)[1]))
    return(N)
  }
  return(NA)
}

As before, I simulate the function 10000 times and present the failed experiments and quantiles.

> est <- sapply(1:10000, poisson_estimator2)
> print(paste("Failed experiments",sum(is.na(est)),"of",NROW(est)))

[1] "Failed experiments 12 of 10000"

    2.5%      50%    97.5% 
166.4932 227.4731 359.6665

The median estimate remains very good but with a slightly worse 95 per cent confidence interval. If I were to remove all the blocking and just analyse based on repeated recapture, the results are:

    2.5%      50%    97.5% 
156.1875 208.2500 312.3750

Blocking makes a difference even when it is imprecise!

Conclusions

I’ve described a gym goer population estimation problem solved using a “mark and recapture” experiment design and corresponding data analysis. I’ve shown how we can improve on the simple Lincoln-Petersen estimator with a Poisson regression by taking into account more of the factors confounding the population estimate. Finally, I’ve shown that under semi-realistic simulation assumptions, the method described can accurately estimate the population using a simple sampling scheme and elementary statistical techniques.