---
title: "Additional calculus of M-estimation examples using `geex`"
author: "Bradley Saul"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Calculus of M-estimation examples using geex}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: stefanski2002
  author:
  - family: Stefanski
    given: Len A.  
  - family: Boos
    given: Dennis D.
  container-title: The calculus of M-estimation
  type: article-journal
  URL: 'http://www.jstor.org/stable/3087324'
  DOI: 10.1198/000313002753631330
  journal: The American Statistician
  issued:
    year: 2002 
  volume: 56
  number: 1
  pages: 29-38
- id: huber1964
  author:
  - family: Huber
    given: Peter J.
  container-title: Robust estimation of a location parameter
  journal: The Annals of Mathematical Statistics
  type: article-journal
  volume: 35
  number: 1
  pages: 73-101
  issued:
    year: 1964
---


## Preliminaries

This vignette implements examples from and is meant to read with @stefanski2002 (SB). Examples 4-9 use the `geexex` data set. For information on how the data are generated, run `help('geexex')`.

```{r, echo = TRUE}
library(geex)
```

## The Basic Approach

### Example 4: Instrumental variables

Example 4 calculates an instumental variable estimator. The variables ($Y_3, W_1, Z_1$) are observed according to:

\[
Y_{3i} = \alpha + \beta X_{1i} + \sigma_{\epsilon}\epsilon_{1, i}
\]

\[
W_{1i} =\beta X_{1i} + \sigma_{U}\epsilon_{2, i}
\]

and 

\[
Z_{1i} = \gamma + \delta X_{1i} + \sigma_{\tau}\epsilon_{3, i}.
\]

$Y_3$ is a linear function of a latent variable $X_1$, whose coefficient $\beta$ is of interest. $W_1$ is a measurement of the unobserved $X_1$, and $Z_1$ is an instrumental variable for $X_1$. The instrumental variable estimator is $\hat{\beta}_{IV}$ is the ratio of least squares regression slopes of $Y_3$ on $Z_1$ and $Y_3$ and $W_1$. The $\psi$ function below includes this estimator as $\hat{\theta}_4 = \hat{\beta}_{IV}$. In the example data, 100 observations are generated where $X_1 \sim \Gamma(\text{shape} = 5, \text{rate} = 1)$, $\alpha = 2$, $\beta = 3$, $\gamma = 2$, $\delta = 1.5$, $\sigma_{\epsilon} = \sigma_{\tau} = 1$, $\sigma_U = .25$, and $\epsilon_{\cdot, i} \sim N(0,1)$. 

\[
\psi(Y_{3i}, Z_{1i}, W_{1i}, \mathbf{\theta}) = 
\begin{pmatrix}
\theta_1 - Z_{1i} \\
\theta_2 - W_{1i} \\
(Y_{3i} - \theta_3 W_{1i})(\theta_2 - W_{1i}) \\
(Y_{3i} - \theta_4 W_{1i})(\theta_1 - Z_{1i}) \\
\end{pmatrix}
\]

```{r SB4_estFUN, echo = TRUE}
SB4_estFUN <- function(data){
  Z1 <- data$Z1; W1 <- data$W1; Y3 <- data$Y3
  function(theta){
      c(theta[1] - Z1,
        theta[2] - W1,
        (Y3 - (theta[3] * W1)) * (theta[2] - W1),
        (Y3 - (theta[4] * W1)) * (theta[1] - Z1)
    )
  }
}
```

```{r SB4_run, echo = TRUE, message=FALSE}
estimates <- m_estimate(
  estFUN = SB4_estFUN, 
  data  = geexex, 
  root_control = setup_root_control(start = c(1, 1, 1, 1)))
```

The R packages [`AER`](https://cran.r-project.org/package=AER) and [`ivpack`](https://cran.r-project.org/package=ivpack) can compute the IV estimator and sandwich variance estimators respectively, which is done below for comparison.  

```{r SB4_clsform, echo = TRUE, eval = FALSE, message = FALSE, results = 'hide'}
ivfit <- AER::ivreg(Y3 ~ W1 | Z1, data = geexex)
iv_se <- ivpack::cluster.robust.se(ivfit, clusterid = 1:nrow(geexex))
```

```{r SB4_results, eval = FALSE, echo = TRUE}
coef(ivfit)[2] 
coef(estimates)[4]
iv_se[2, 'Std. Error'] 
sqrt(vcov(estimates)[4, 4])
```

## Connections to the Influence Curve

### Example 5: Hodges-Lehmann estimator

This example shows the link between the influence curves and the Hodges-Lehman estimator.

\[
\psi(Y_{2i}, \theta) = 
\begin{pmatrix}
IC_{\hat{\theta}_{HL}}(Y_2; \theta_0) - (\theta_1 - \theta_0) \\
\end{pmatrix}
\]

The functions used in `SB5_estFUN` are defined below:

```{r SB5_internals, echo = TRUE}
F0 <- function(y, theta0, distrFUN = pnorm){
  distrFUN(y - theta0, mean = 0)
}

f0 <- function(y, densFUN){
  densFUN(y, mean = 0)
}

integrand <- function(y, densFUN = dnorm){
  f0(y, densFUN = densFUN)^2
}

IC_denom <- integrate(integrand, lower = -Inf, upper = Inf)$value
```

```{r SB5_estFUN, echo = TRUE}
SB5_estFUN <- function(data){
  Yi <- data[['Y2']]
  function(theta){

     (1/IC_denom) * (F0(Yi, theta[1]) - 0.5)
  }
}
```

```{r SB5_run, echo = TRUE, message=FALSE}
estimates <- m_estimate(
  estFUN = SB5_estFUN, 
  data  = geexex,
  root_control = setup_root_control(start = 2))
```

The `hc.loc` of the [`ICSNP`](https://cran.r-project.org/package=ICSNP) package computes the Hodges-Lehman estimator and SB gave closed form estimators for the variance.

```{r SB5_clsform, echo = TRUE}
theta_cls <- ICSNP::hl.loc(geexex$Y2)
Sigma_cls <- 1/(12 * IC_denom^2) / nrow(geexex)
```

```{r SB5_results, echo = FALSE}
results <- list(geex = list(parameters = coef(estimates), vcov = vcov(estimates)), 
                cls = list(parameters = theta_cls, vcov = Sigma_cls))
results
```

## Non-smooth $\psi$ functions

### Example 6: Huber center of symmetry estimator

This example illustrates estimation with nonsmooth $\psi$ function for computing the @huber1964 estimator of the center of symmetric distributions. 

\[
\psi_k(Y_{1i}, \theta) = 
\begin{cases}
(Y_{1i} - \theta)  & \text{when } |(Y_{1i} - \theta)| \leq k \\
k * sgn(Y_{1i} - \theta) &  \text{when } |(Y_{1i} - \theta)| > k\\
\end{cases}
\]

```{r SB6_estFUN, echo = TRUE}
SB6_estFUN <- function(data, k = 1.5){
  Y1 <- data$Y1
  function(theta){
    x <- Y1 - theta[1]
    if(abs(x) <= k) x else sign(x) * k
  }
}
```

```{r SB6_run, echo = TRUE, message=FALSE}
estimates <- m_estimate(
  estFUN = SB6_estFUN, 
  data  = geexex,
  root_control = setup_root_control(start = 3))
```

The point estimate from `geex` is compared to the estimate obtained from the `huber` function in the [`MASS`](https://cran.r-project.org/package=MASS) package. The variance estimate is compared to an estimator suggested by SB: $A_m = \sum_i [-\psi'(Y_i - \hat{\theta})]/m$ and $B_m = \sum_i \psi_k^2(Y_i - \hat{\theta})/m$, where $\psi_k'$ is the derivative of $\psi$ where is exists.

```{r SB6_clsform, echo = TRUE}
theta_cls <- MASS::huber(geexex$Y1, k = 1.5, tol = 1e-10)$mu

psi_k <- function(x, k = 1.5){
  if(abs(x) <= k) x else sign(x) * k
}

A <- mean(unlist(lapply(geexex$Y1, function(y){
  x <- y - theta_cls
  -numDeriv::grad(psi_k, x = x)
})))

B <- mean(unlist(lapply(geexex$Y1, function(y){
  x <- y - theta_cls
  psi_k(x = x)^2
})))

## closed form covariance
Sigma_cls <- matrix(1/A * B * 1/A / nrow(geexex))
```

```{r SB6_results, echo = TRUE}
results <- list(geex = list(parameters = coef(estimates), vcov = vcov(estimates)), 
                cls = list(parameters = theta_cls, vcov = Sigma_cls))
results
```

### Example 7: Sample quantiles (approximation of $\psi$)

*Approximation of $\psi$ with geex is EXPERIMENTAL.*

Example 7 illustrates calculation of sample quantiles using M-estimation and approximation of the $\psi$ function.

\[
\psi(Y_{1i}, \theta) = 
\begin{pmatrix}
0.5 - I(Y_{1i} \leq \theta_1) \\
\end{pmatrix}
\]

```{r SB7_estFUN, echo = TRUE}
SB7_estFUN <- function(data){
  Y1 <- data$Y1
  function(theta){
    0.5  - (Y1 <= theta[1])
  }
}
```

Note that $\psi$ is not differentiable; however, `geex` includes the ability to approximate the $\psi$ function via an `approx_control` object. The `FUN` in an `approx_control` must be a function that takes in the $\psi$ function, modifies it, and returns a function of `theta`. For this example, I approximate $\psi$ with a spline function. The `eval_theta` argument is used to modify the basis of the spline.

```{r approx}
spline_approx <- function(psi, eval_theta){
  y <- Vectorize(psi)(eval_theta)
  f <- splinefun(x = eval_theta, y = y)
  function(theta) f(theta)
}
```


```{r SB7_run, echo = TRUE, message=FALSE}
estimates <- m_estimate(
  estFUN = SB7_estFUN, 
  data  = geexex, 
  root_control = setup_root_control(start = 4.7),
  approx_control  = setup_approx_control(FUN = spline_approx,
                                      eval_theta = seq(3, 6, by = .05)))
```

A comparison of the variance is not obvious, so no comparison is made.

```{r SB7_results, echo = FALSE}
results <- list(geex = list(parameters = coef(estimates), vcov = vcov(estimates)), 
                cls = list(parameters = median(geexex$Y1), vcov = NA))
results
```

## Regression M-estimators

### Example 8: Robust regression

Example 8 demonstrates robust regression for estimating $\beta$ from 100 observations generated from $Y_4 = 0.1 + 0.1 X_{1i} + 0.5 X_{2i} + \epsilon_i$, where $\epsilon_i \sim N(0, 1)$, $X_1$ is defined as above, and the first half of the observation have $X_{2i} = 1$ and the others have $X_{2i} = 0$. 

\[
\psi_k(Y_{4i}, \theta) = 
\begin{pmatrix}
\psi_k(Y_{4i} - \mathbf{x}_i^T \beta) \mathbf{x}_i
\end{pmatrix}
\]

```{r SB8_estFUN, echo = TRUE}
psi_k <- function(x, k = 1.345){
  if(abs(x) <= k) x else sign(x) * k
}

SB8_estFUN <- function(data){
  Yi <- data$Y4
  xi <- model.matrix(Y4 ~ X1 + X2, data = data)
  function(theta){
    r <- Yi - xi %*% theta
    c(psi_k(r) %*% xi)
  }
}
```

```{r SB8_run, echo = TRUE, message = FALSE}
estimates <- m_estimate(
  estFUN = SB8_estFUN, 
  data  = geexex,
  root_control = setup_root_control(start = c(0, 0, 0)))
```

```{r SB8_clsform, echo = TRUE}
m <- MASS::rlm(Y4 ~ X1 + X2, data = geexex, method = 'M')
theta_cls <- coef(m)
Sigma_cls <- vcov(m)
```

```{r SB8_results, echo = TRUE}
results <- list(geex = list(parameters = coef(estimates), vcov = vcov(estimates)), 
                cls = list(parameters = theta_cls, vcov = Sigma_cls))
results
```

### Example 9: Generalized linear models

Example 9 illustrates estimation of a generalized linear model. 

\[
\psi(Y_i, \theta) = 
\begin{pmatrix}
D_i(\beta)\frac{Y_i - \mu_i(\beta)}{V_i(\beta) \tau}
\end{pmatrix}
\]

```{r SB9_estFUN, echo = TRUE}
SB9_estFUN <- function(data){
  Y <- data$Y5
  X <- model.matrix(Y5 ~ X1 + X2, data = data, drop = FALSE)
  function(theta){
    lp <- X %*% theta
    mu <- plogis(lp)
    D  <- t(X) %*% dlogis(lp)
    V  <- mu * (1 - mu)
    D %*% solve(V) %*% (Y - mu)
  }
}
```

```{r SB9_run, echo = TRUE, message = FALSE}
estimates <- m_estimate(
  estFUN = SB9_estFUN, 
  data  = geexex, 
  root_control = setup_root_control(start = c(.1, .1, .5)))
```

Compare point estimates to `glm` coefficients and covariance matrix to `sandwich`.

```{r SB9_clsform, echo = TRUE}
m9 <- glm(Y5 ~ X1 + X2, data = geexex, family = binomial(link = 'logit'))
theta_cls <- coef(m9)
Sigma_cls <- sandwich::sandwich(m9)
```

```{r SB9_results, echo = TRUE}
results <- list(geex = list(parameters = coef(estimates), vcov = vcov(estimates)),  
                cls = list(parameters = theta_cls, vcov = Sigma_cls))
results
```

## Application to test statistics

### Example 10: Testing equality of success probabilities

```{r SB10_setup, echo=FALSE}
shaq <- data.frame(
  game = 1:23,
  ft_made = c(4, 5, 5, 5, 2, 7, 6, 9, 4, 1, 13, 5, 6, 9, 7, 3, 8, 1, 18, 3, 10, 1, 3),
  ft_attp = c(5, 11, 14, 12, 7, 10, 14, 15, 12, 4, 27, 17, 12, 9, 12, 10, 12, 6, 39, 13, 17, 6, 12))
```

Example 10 illustrates testing equality of success probablities. 

\[
\psi(Y_i, n_i, \theta) = 
\begin{pmatrix}
\frac{(Y_i - n_i \theta_2)^2}{n_i \theta_2( 1 - \theta_2 )} - \theta_1 \\
Y_i - n_i \theta_2
\end{pmatrix}
\]

```{r SB10_estFUN, echo = TRUE}
SB10_estFUN <- function(data){
  Y <- data$ft_made
  n <- data$ft_attp
  function(theta){
    p <- theta[2]
    c(((Y - (n * p))^2)/(n * p * (1 - p))  - theta[1], 
      Y - n * p)
  }
}
```

```{r SB10_run, echo = TRUE}
estimates <- m_estimate(
  estFUN = SB10_estFUN, 
  data  = shaq, 
  units = 'game', 
  root_control = setup_root_control(start = c(.5, .5)))
```

```{r SB10_clsform, echo = TRUE}
V11 <- function(p) {
  k    <- nrow(shaq)
  sumn <- sum(shaq$ft_attp)
  sumn_inv <- sum(1/shaq$ft_attp)
  term2_n  <- 1 - (6 * p) + (6 * p^2)
  term2_d <- p * (1 - p) 
  term2  <- term2_n/term2_d
  term3  <- ((1 - (2 * p))^2) / ((sumn/k) * p * (1 - p))
  2 + (term2 * (1/k) * sumn_inv) - term3
}

p_tilde <- sum(shaq$ft_made)/sum(shaq$ft_attp)
V11_hat <- V11(p_tilde)/23

# Compare variance estimates
V11_hat
vcov(estimates)[1, 1]

# Note the differences in the p-values
pnorm(35.51/23, mean  = 1, sd = sqrt(V11_hat), lower.tail = FALSE)

pnorm(coef(estimates)[1], 
      mean = 1, 
      sd = sqrt(vcov(estimates)[1, 1]),
      lower.tail = FALSE)
```

This example shows that the empircal sandwich variance estimator may be different from other sandwich variance estimators that make assumptions about the structure of the $A$ and $B$ matrices.

## References