# 6 Random and Mixed Effects Models

In this chapter we use a new philosophy. Up to now, treatment effects
(the \(\alpha_i\)’s) were *fixed*, unknown quantities that we tried to
estimate. This means we were making a statement about a *specific*,
*fixed* set of treatments (e.g., some specific fertilizers or different
vaccine types). Such models are also called
**fixed effects models**.

## 6.1 Random Effects Models

### 6.1.1 One-Way ANOVA

Now we use another point of view: We consider situations where treatments are
**random samples** from a **large population** of treatments. This might seem
quite special at first sight, but it is actually very natural in many
situations. Think for example of investigating employee performance of those who
were randomly selected. Another example could be machines that were
randomly sampled from a large population of machines. Typically, we are
interested in making a statement about some properties of the whole population
and not of the observed individuals (here, employees or machines).

Going further with the machine example: Assume that every machine
produces some samples whose quality we assess. We denote the observed quality of
the \(j\)th sample on the \(i\)th machine with \(y_{ij}\). We can model such
data with the model
\[\begin{equation}
Y_{ij} = \mu + \alpha_i + \epsilon_{ij},
\tag{6.1}
\end{equation}\]
where \(\alpha_i\) is the effect of the \(i\)th machine. As the machines
were drawn randomly from a large population, we assume
\[
\alpha_i \; \textrm{i.i.d.} \sim N(0, \sigma_{\alpha}^2).
\]
We call \(\alpha_i\) a **random effect**. Hence,
(6.1) is a so-called **random effects
model**. For the error term we have the
usual assumption \(\epsilon_{ij}\) i.i.d. \(\sim N(0, \sigma^2)\). In addition,
we assume that \(\alpha_i\) and \(\epsilon_{ij}\) are independent. This
looks very similar to the old fixed effects model
(2.4) at first sight. However, note that the
\(\alpha_i\)’s in Equation (6.1) are **random
variables** (reflecting the sampling mechanism of the data) and *not*
fixed unknown parameters. This small change will have a large impact
on the properties of the model. In addition, we have a new parameter
\(\sigma_{\alpha}^2\) which is the variance of the random effect (here, the
variance between different machines). Sometimes, such models are also
called **variance components models** because of the different variances
\(\sigma_{\alpha}^2\) and \(\sigma^2\) (more complex models will have
additional variances).

Let us now inspect some properties of model (6.1).
The expected value of \(Y_{ij}\) is
\[
E[Y_{ij}] = \mu.
\]
For the variance of \(Y_{ij}\) we have
\[
\Var(Y_{ij}) = \sigma_{\alpha}^2 + \sigma^2.
\]
For the correlation structure we get
\[
\Cor(Y_{ij}, Y_{kl}) =
\left\{
\begin{array}{cc}
0 & i \neq k \\
\sigma_{\alpha}^2 / (\sigma_{\alpha}^2 + \sigma^2) & i = k, j \neq l \\
1 & i = k, j = l
\end{array}
\right.
\]
Observations from different machines (\(i \ne k\)) are uncorrelated while
observations from the *same* machine (\(i = k\)) are correlated.
We also call the correlation within the same machine
\(\sigma_{\alpha}^2 / (\sigma_{\alpha}^2 + \sigma^2)\)
**intraclass correlation (ICC)**. It is
large if \(\sigma_{\alpha}^2 \gg \sigma^2\). A large value means that observations
from the *same* “group” (here, machine) are much more similar than observations
from *different* groups (machines). Three different scenarios for an example
with six observations from each of the eight machines can be found in Figure
6.1.

This nonzero correlation within the same machine is in contrast to the fixed
effects model (2.4), where all values are
*independent*, because there, the \(\alpha_i\)’s are *parameters*, that is
*fixed*, unknown quantities, and therefore also uncorrelated.

Parameter estimation for the variance components \(\sigma_{\alpha}^2\) and
\(\sigma^2\) is nowadays typically being done with a technique called **restricted
maximum likelihood (REML)**, see for
example Jiang and Nguyen (2021) for more details. We could also use classical
maximum likelihood estimators here, but REML estimates are less biased. The
parameter \(\mu\) is estimated with maximum likelihood assuming that the variances
are known. Historically, method of moments estimators were (or are) also heavily
used, see for example
Kuehl (2000) or Oehlert (2000).

In `R`

, there are many packages that can fit such models. We will consider
`lme4`

(Bates et al. 2015) and later also `lmerTest`

(Kuznetsova, Brockhoff, and Christensen 2017), which
basically uses `lme4`

for model fitting and adds some statistical tests on
top.

Let us now consider Exercise 5.1 from Kuehl (2000) about an inheritance
study with beef animals. Five sires, male animals, were each mated to a
*separate* group of dams, female animals. The birth weights of eight male
calves (from different dams) in each of the five sire groups were
recorded. We want to find out what part of the total variation is due to
variation between different sires. In fact, such random effects models were used
very early in animal breeding programs, see for example Henderson (1963).

We first create the data set and visualize it.

```
## Create data set ####
weight <- c(61, 100, 56, 113, 99, 103, 75, 62, ## sire 1
75, 102, 95, 103, 98, 115, 98, 94, ## sire 2
58, 60, 60, 57, 57, 59, 54, 100, ## sire 3
57, 56, 67, 59, 58, 121, 101, 101, ## sire 4
59, 46, 120, 115, 115, 93, 105, 75) ## sire 5
sire <- factor(rep(1:5, each = 8))
animals <- data.frame(weight, sire)
str(animals)
```

```
## 'data.frame': 40 obs. of 2 variables:
## $ weight: num 61 100 56 113 99 103 ...
## $ sire : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 ...
```

```
## Visualize data ####
stripchart(weight ~ sire, vertical = TRUE, pch = 1, xlab = "sire",
data = animals)
```

At first sight it looks like the variation between different sires is rather small.

Now we fit the random effects model with the `lmer`

function in package `lme4`

.
We want to have a random effect per sire. This can be specified with the
notation `(1 | sire)`

in the model formula. This means that the “granularity” of
the random effect is specified after the vertical bar “`|`

”. All observations
sharing the same level of `sire`

will get the same random effect \(\alpha_i\). The
“`1`

” means that we only want to have a so-called random intercept (\(\alpha_i\))
per sire. In the regression setup, we could also have a random slope for which
we would use the notation `(x | sire)`

for a continuous predictor `x`

. However,
this is not of relevance here.

As usual, the function `summary`

gives an overview of the fitted model.

`summary(fit.animals)`

```
## Linear mixed model fit by REML ['lmerMod']
## ...
## Random effects:
## Groups Name Variance Std.Dev.
## sire (Intercept) 117 10.8
## Residual 464 21.5
## Number of obs: 40, groups: sire, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 82.55 5.91 14
```

From the summary we can read off the table labelled `Random Effects`

that \(\widehat{\sigma}_{\alpha}^2 = 117\) (`sire`

) and
\(\widehat{\sigma}^2 = 464\) (`Residual`

). Note that the column `Std.Dev`

is
nothing more than the square root of the variance and *not* the standard
error (accuracy) of the variance estimate.

The variance of \(Y_{ij}\) is
therefore estimated as \(117 + 464 = 581\). Hence, only about \(117 / 581 = 20\%\) of the total
variance of the birth weight is due to sire, this is the intraclass
correlation. This confirms our visual inspection from above. It is good practice
to verify whether the grouping structure was interpreted
correctly. In the row `Number of obs`

we can read off that we have 40
observations in total and a grouping factor given by `sire`

which has five levels.

Under `Fixed effects`

we find the estimate \(\widehat{\mu} = 82.55\).
It is an estimate for the expected birth weight of a male calf
of a randomly selected sire, randomly selected from the *whole*
population of all sires.

Approximate 95% confidence intervals for the parameters can be obtained with the function
`confint`

(the argument `oldNames`

has to bet set to `FALSE`

to get a nicely
labelled output).

`confint(fit.animals, oldNames = FALSE)`

```
## 2.5 % 97.5 %
## sd_(Intercept)|sire 0.00 24.62
## sigma 17.33 27.77
## (Intercept) 69.84 95.26
```

Hence, an approximate 95% confidence interval for the standard deviation
\(\sigma_{\alpha}\) is given by \([0, \, 24.62]\). We see
that the estimate \(\widehat{\sigma}_{\alpha}\) is therefore quite
imprecise. The reason is that we only have five sires to estimate the
standard deviation. The corresponding confidence interval for the variance \(\sigma_{\alpha}^2\) would
be given by \([0^2, \, 24.62^2]\) (we can simply apply the square
function to the interval boundaries). The row labelled with `(Intercept)`

contains the corresponding
confidence interval for \(\mu\).

What would happen if we used the “ordinary” `aov`

function here?

```
options(contrasts = c("contr.sum", "contr.poly"))
fit.animals.aov <- aov(weight ~ sire, data = animals)
confint(fit.animals.aov)
```

```
## 2.5 % 97.5 %
## (Intercept) 75.637 89.463
## sire1 -12.751 14.901
## sire2 1.124 28.776
## sire3 -33.251 -5.599
## sire4 -18.876 8.776
```

As we have used `contr.sum`

, the confidence interval for \(\mu\) which can
be found under `(Intercept)`

is a confidence interval for the average of
the expected birth weight of the male offsprings of these five *specific*
sires. More precisely, each of the five sires has its expected
birth weight \(\mu_i, i = 1, \ldots, 5\) (of a male offspring). With
`contr.sum`

we have
\[
\mu = \frac{1}{5} \sum_{i=1}^g \mu_i,
\]
i.e., \(\mu\) is the average of the expected value of these five
*specific* sires. We see that this confidence interval is *shorter* than
the one from the random effects model. The reason is the different
interpretation. The random effects model allows to make inference about
the population of all sires (where we have seen five so far), while the
fixed effects model allows to make inference about these five *specific*
sires. Hence, we are facing a more difficult problem with the random
effects model; this is why we are less confident in our estimate
resulting in wider confidence intervals compared to the fixed effects
model.

We can also get “estimates” of the random effects \(\alpha_i\) with the function `ranef`

.

`ranef(fit.animals)`

```
## $sire
## (Intercept)
## 1 0.7183
## 2 9.9895
## 3 -12.9797
## 4 -3.3744
## 5 5.6462
##
## with conditional variances for "sire"
```

More precisely, as the random effects are random quantities and *not* fixed
parameters, what we get with `ranef`

are the conditional
means (given the observed data) which are the best
linear unbiased predictions, also
known as **BLUPs** (Robinson 1991). Typically, these estimates are shrunken
toward zero because of the normal assumption, hence we get slightly
different results compared to the fixed effects model. This is also known as the
so-called **shrinkage property**.

We should of course also check the model assumptions. Here, this includes in
addition normality of the random effects, though this is hard to check with only
five observations. We get the Tukey-Anscombe plot
with the `plot`

function.

`plot(fit.animals) ## TA-plot`

To get QQ-plots of the random effects and the residuals, we need to
extract them first and then use `qqnorm`

as usual.

```
par(mfrow = c(1, 2))
qqnorm(ranef(fit.animals)$sire[,"(Intercept)"],
main = "Random effects")
qqnorm(resid(fit.animals), main = "Residuals")
```

Depending on the model complexity, residual analysis for models including random
effects can be subtle, this includes the models we will learn about in Section
6.2 too, see for example Santos Nobre and Motta Singer (2007),
Loy and Hofmann (2015) or Loy, Hofmann, and Cook (2017). Some implementations of these papers can be found in
package `HLMdiag`

(Loy and Hofmann 2014).

Sometimes you also see statistical tests of the form \(H_0: \sigma_{\alpha} = 0\)
vs. \(H_A: \sigma_{\alpha} > 0\). For model (6.1) we could
actually get exact p-values, while for complex models later this is not the case
anymore. In these situations, approximate or simulation based methods have to be used. Some
can be found in package `RLRsim`

(Scheipl, Greven, and Kuechenhoff 2008).

### 6.1.2 More Than One Factor

So far this was a one-way ANOVA model with a random effect. We can extend this to the two-way ANOVA situation and beyond. For this reason, we consider the following example:

A large company randomly selected five employees and six batches of source material from the production process. The material from each batch was divided into 15 pieces which were randomized to the different employees such that each employee would build three test specimens from each batch. The response was the corresponding quality score. The goal was to quantify the different sources of variation.

We first load the data and get an overview.

```
book.url <- "https://stat.ethz.ch/~meier/teaching/book-anova"
quality <- readRDS(url(file.path(book.url, "data/quality.rds")))
str(quality)
```

```
## 'data.frame': 90 obs. of 3 variables:
## $ employee: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 ...
## $ batch : Factor w/ 6 levels "B1","B2","B3",..: 1 1 1 2 2 2 ...
## $ score : num 27.4 27.8 27.3 25.5 25.5 26.4 ...
```

`xtabs(~ batch + employee, data = quality)`

```
## employee
## batch 1 2 3 4 5
## B1 3 3 3 3 3
## B2 3 3 3 3 3
## B3 3 3 3 3 3
## B4 3 3 3 3 3
## B5 3 3 3 3 3
## B6 3 3 3 3 3
```

We can for example visualize this data set with an interaction plot.

```
with(quality, interaction.plot(x.factor = batch,
trace.factor = employee,
response = score))
```

There seems to be variation between different employees and between the different batches. The interaction does not seem to be very pronounced. Let us set up a model for this data. We denote the observed quality score of the \(k\)th sample of employee \(i\) using material from batch \(j\) by \(y_{ijk}\). A natural model is then \[\begin{equation} Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}, \tag{6.2} \end{equation}\] where \(\alpha_i\) is the random (main) effect of employee \(i\), \(\beta_j\) is the random (main) effect of batch \(j\), \((\alpha\beta)_{ij}\) is the random interaction term between employee \(i\) and batch \(j\) and \(\epsilon_{ijk} \textrm{ i.i.d.} \sim N(0, \sigma^2)\) is the error term. For the random effects we have the usual assumptions \[\begin{align*} \alpha_i & \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha}^2), \\ \beta_j & \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2), \\ (\alpha\beta)_{ij} & \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha\beta}^2). \end{align*}\] In addition, we assume that \(\alpha_i\), \(\beta_j\), \((\alpha\beta)_{ij}\) and \(\epsilon_{ijk}\) are independent. As each random term in the model has its own variance component, we now have the variances \(\sigma_{\alpha}^2\), \(\sigma_{\beta}^2\), \(\sigma_{\alpha\beta}^2\) and \(\sigma^2\).

What is the interpretation of the different terms? The random (main) effect of employee is the variability between different employees, and the random (main) effect of batch is the variability between different batches. The random interaction term can for example be interpreted as quality inconsistencies of employees across different batches.

Let us fit such a model in `R`

. We want to have a random effect per employee (=
`(1 | employee)`

), a random effect per batch (= `(1 | batch)`

), and a random
effect per combination of employee and batch (= `(1 | employee:batch)`

).

```
fit.quality <- lmer(score ~ (1 | employee) + (1 | batch) +
(1 | employee:batch), data = quality)
summary(fit.quality)
```

```
## Linear mixed model fit by REML ['lmerMod']
## ...
## Random effects:
## Groups Name Variance Std.Dev.
## employee:batch (Intercept) 0.0235 0.153
## batch (Intercept) 0.5176 0.719
## employee (Intercept) 1.5447 1.243
## Residual 0.2266 0.476
## Number of obs: 90, groups: employee:batch, 30; batch, 6; employee, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 27.248 0.631 43.2
```

From the output we get \(\widehat{\sigma}_{\alpha}^2 = 1.54\) (employee), \(\widehat{\sigma}_{\beta}^2 = 0.52\) (batch), \(\widehat{\sigma}_{\alpha\beta}^2 = 0.02\) (interaction of employee and batch) and \(\widehat{\sigma}^2 = 0.23\) (error term).

Hence, total variance is \(1.54 + 0.52 + 0.02 + 0.23 = 2.31\). We see that the largest contribution to the variance
is variability between different employees which contributes to about
\(1.54 / 2.31 \approx 67\%\)
of the total variance. These are all point estimates. Confidence intervals on
the scale of the standard deviations can be obtained by calling `confint`

.

`confint(fit.quality, oldNames = FALSE)`

```
## 2.5 % 97.5 %
## sd_(Intercept)|employee:batch 0.0000 0.3528
## sd_(Intercept)|batch 0.4100 1.4754
## sd_(Intercept)|employee 0.6694 2.4994
## sigma 0.4020 0.5741
## (Intercept) 25.9190 28.5766
```

Uncertainty is quite large. In addition, there is no statistical evidence that the random interaction term is really needed, as the corresponding confidence interval contains zero. To get a more narrow confidence interval for the standard deviation between different employees, we would need to sample more employees. Basically, each employee contributes one observation for estimating the standard deviation (or variance) between employees. The same reasoning applies to batch.

### 6.1.3 Nesting

To introduce a new concept, we consider the `Pastes`

data set in package `lme4`

.
The strength of a chemical paste product was measured for a total of 30
samples coming from 10 randomly selected delivery batches where each
contained three randomly selected casks. Each sample was measured twice, resulting
in a total of 60 observations. We want to check what part of the total
variability of strength is due to variability between batches, between casks and
due to measurement error.

```
## 'data.frame': 60 obs. of 4 variables:
## $ strength: num 62.8 62.6 60.1 62.3 62.7 63.1 ...
## $ batch : Factor w/ 10 levels "A","B","C","D",..: 1 1 1 1 1 1 ...
## $ cask : Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3 ...
## $ sample : Factor w/ 30 levels "A:a","A:b","A:c",..: 1 1 2 2 3 3 ...
```

Note that the levels of `batch`

and `cask`

are given by upper- and lowercase
letters, respectively. If we carefully think about the data structure, we have
just discovered a new way of combining factors. Cask `a`

in batch `A`

has
nothing to do with cask `a`

in batch `B`

and so on. The level `a`

of `cask`

has
a different meaning for every level of `batch`

. Hence, the two factors `cask`

and `batch`

are *not* crossed. We say cask is
**nested** in batch. The data set also
contains an additional (redundant) factor `sample`

which is a unique identifier
for each sample, given by the combination of `batch`

and `cask`

.

We use package `ggplot2`

to visualize the data set (`R`

code for interested
readers only). The different panels are the different batches (`A`

to `J`

).

```
library(ggplot2)
ggplot(Pastes, aes(y = cask, x = strength)) + geom_point() +
facet_grid(batch ~ .)
```

The batch effect does not seem to be very pronounced, for example, there is no clear tendency that some batches only contain large values, while others only contain small values. Casks within the same batch can be substantially different, but the two measurements from the same cask are typically very similar. Let us now set up an appropriate random effects model for this data set.

Let \(y_{ijk}\) be the observed strength of the \(k\)th sample of cask \(j\) in batch \(i\).
We can then use the model
\[\begin{equation}
Y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \epsilon_{ijk},
\tag{6.3}
\end{equation}\]
where \(\alpha_i\) is the random effect of batch and \(\beta_{j(i)}\) is
the random effect of cask *within* batch. Note the special notation
\(\beta_{j(i)}\) which emphasizes that cask is nested in batch. This means that
each batch gets its *own* coefficient for the cask effect (from a technical
point of view this is like an interaction effect without the corresponding
main effect).
We make the usual assumptions for the random effects
\[
\alpha_i \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha}^2), \quad
\beta_{j(i)} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2),
\]
and similarly for the error term \(\epsilon_{ijk} \sim N(0, \sigma^2)\). As before,
we assume independence between all random terms.

We have to tell `lmer`

about the nesting structure. There are multiple ways to
do so. We can use the notation `(1 | batch/cask)`

which means that we want to
have a random effect per batch and per cask *within* batch. We could also use
`(1 | batch) + (1 | cask:batch)`

which means that we want to have a random
effect per batch and a random effect per *combination* of batch and cask, which
is the same as a nested effect. Last but not least, here we could also use `(1 | batch) + (1 | sample)`

. Why? Because `sample`

is the combination of `batch`

and `cask`

. Note that we *cannot* use the notation `(1 | batch) + (1 | cask)`

here because then for example all casks `a`

(across different batches) would
share the same effect (similar for the other levels of `cask`

). This is not what
we want.

Let us fit a model using the notation `(1 | batch/cask)`

.

```
## Linear mixed model fit by REML ['lmerMod']
## ...
## Random effects:
## Groups Name Variance Std.Dev.
## cask:batch (Intercept) 8.434 2.904
## batch (Intercept) 1.657 1.287
## Residual 0.678 0.823
## Number of obs: 60, groups: cask:batch, 30; batch, 10
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 60.053 0.677 88.7
```

We have \(\widehat{\sigma}_{\alpha}^2 = 1.66\) (batch),
\(\widehat{\sigma}_{\beta}^2 = 8.43\) (cask) and \(\widehat{\sigma}^2 = 0.68\) (measurement error). This confirms that most variation is in
fact due to cask within batch. Confidence intervals could be obtained as usual
with the function `confint`

(not shown).

A nested effect has also some associated degrees of freedom. In the previous example, cask has 3 levels in each of the 10 batches. The reasoning is now as follows: As cask needs 2 degrees of freedom in each level of the 10 batches, the degrees of freedom of cask are \(10 \cdot 2 = 20\).

## 6.2 Mixed Effects Models

In practice, we often encounter models which contain both random and fixed
effects. We call them **mixed models** or **mixed effects
models**.

### 6.2.1 Example: Machines Data

We start with the data set `Machines`

in package `nlme`

(Pinheiro et al. 2021). As stated
in the help file: “Data on an experiment to compare three brands of machines
used in an industrial process […]. Six workers were chosen randomly among the
employees of a factory to operate each machine three times. The response is an
overall productivity score taking into account the number and quality of
components produced.”

```
data("Machines", package = "nlme")
## technical details for shorter output:
class(Machines) <- "data.frame"
Machines[, "Worker"] <- factor(Machines[, "Worker"], levels = 1:6,
ordered = FALSE)
str(Machines, give.attr = FALSE) ## give.attr to shorten output
```

```
## 'data.frame': 54 obs. of 3 variables:
## $ Worker : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 2 2 2 ...
## $ Machine: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 ...
## $ score : num 52 52.8 53.1 51.8 52.8 53.1 ...
```

Let us first visualize the data. In addition to plotting all individual data, we
also calculate the mean for each combination of worker and machine and connect
these values with lines for each worker (`R`

code for interested readers only).

```
ggplot(Machines, aes(x = Machine, y = score, group = Worker, col = Worker)) +
geom_point() + stat_summary(fun = mean, geom = "line") + theme_bw()
```

```
## A classical interaction plot would be (not shown)
with(Machines, interaction.plot(x.factor = Machine,
trace.factor = Worker,
response = score))
```

We observe that on average, productivity is largest on machine `C`

, followed by `B`

and `A`

. Most workers show a similar profile, with the exception of
worker 6 who performs badly on machine `B`

.

Let us now try to model this data. The goal is to make inference about the
specific machines at hand, this is why we treat machine as a fixed effect. We
assume that there is a population machine effect (think of an average profile
across *all* potential workers), but each worker is allowed to have its own
random deviation. With \(y_{ijk}\) we denote the observed \(k\)th productivity score of
worker \(j\) on machine \(i\). We use the model
\[\begin{equation}
Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk},
\tag{6.4}
\end{equation}\]
where \(\alpha_i\) is the fixed effect of machine \(i\) (with a usual
side constraint), \(\beta_j\) is the random effect of worker \(j\) and
\((\alpha\beta)_{ij}\) is the corresponding random interaction effect. An
interaction effect between a random (here, worker) and a fixed effect (here,
machine) is treated as a random effect.

What is the interpretation of this model? The average (over the whole
population) productivity profile with respect to the three different machines is
given by the fixed effect \(\alpha_i\). The random deviation of an individual
worker consists of a general shift \(\beta_j\) (worker specific *general*
productivity level) and a worker specific preference \((\alpha\beta)_{ij}\) with
respect to the three machines. We assume that all random effects are normally
distributed, this means
\[
\beta_j \, \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2), \quad
(\alpha\beta)_{ij} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha\beta}^2).
\]
For the error term we assume as always \(\epsilon_{ijk} \, \textrm{ i.i.d.} \sim N(0, \sigma^2)\). In
addition, all random terms are assumed to be independent.

We visualize model (6.4) step-by-step in Figure
6.2. The solid line is the population average of
the machine effect \((= \mu + \alpha_i)\), the dashed line is the population
average including the worker specific general productivity level \((= \mu + \alpha_i + \beta_j)\), i.e., a *parallel* shift of the solid line. The dotted line
in addition includes the worker specific machine preference \((= \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij})\). Individual observations of worker \(j\) now
fluctuate around these values.

We could use the `lme4`

package to fit such a model. Besides the fixed effect of
machine type we want to have the following random effects:

- a random effect \(\beta_j\) per worker:
`(1 | Worker)`

- a random effect \((\alpha\beta)_{ij}\) per combination of worker and machine:
`(1 | Worker:Machine)`

Hence, the `lmer`

call would look as follows.

```
fit.machines <- lmer(score ~ Machine + (1 | Worker) +
(1 | Worker:Machine), data = Machines)
```

As `lme4`

does *not* calculate p-values for the fixed effects, we use the
package `lmerTest`

instead. Technically speaking, `lmerTest`

uses `lme4`

to fit
the model and then adds some statistical tests, i.e., p-values for the fixed
effects, to the output. There are still many open issues regarding statistical
inference in mixed models, see for example the “GLMM
FAQ” (the so-called
generalized linear mixed models frequently asked questions). However, for “nice”
designs (as the current one), we get proper statistical inference using the
`lmerTest`

package.

```
options(contrasts = c("contr.treatment", "contr.poly"))
library(lmerTest)
fit.machines <- lmer(score ~ Machine + (1 | Worker) +
(1 | Worker:Machine), data = Machines)
```

We get the ANOVA table for the fixed effects with the function `anova`

.

`anova(fit.machines)`

```
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Machine 38.1 19 2 10 20.6 0.00029
```

The fixed effect of machine is significant. If we closely inspect the output,
we see that an \(F\)-distribution with 2 and 10 degrees of freedom is being used.
Where does the value 10 come from? It seems to be a rather small number given
the sample size of 54 observations. Let us remind ourselves what the fixed
effect actually means. It is the *average* machine effect, where the average is
taken over the whole population of workers (where we have seen only six). We
know that every worker has its own random deviation of this effect. Hence, the
relevant question is whether the worker profiles just fluctuate around a
constant machine effect (all \(\alpha_i = 0\)) or whether the machine effect is
substantially larger than this worker specific machine variation. Technically
speaking, the worker specific machine variation is nothing more than the
*interaction* between worker and machine. Hence, this boils down to comparing
the variation between different machines (having 2 degrees of freedom) to the
variation due to the interaction between machines and workers (having \(2 \cdot 5 = 10\) degrees of freedom). The `lmerTest`

package automatically detects this
because of the structure of the random effects. This way of thinking also allows
another insight: If we want the estimate of the population average of the
machine effect (the fixed effect of machine) to have a desired accuracy, the
relevant quantity to increase is the number of workers.

If we call `summary`

on the fitted object, we get the individual
\(\widehat{\alpha}_i\)’s under `Fixed effects`

.

`summary(fit.machines)`

```
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## ...
## Random effects:
## Groups Name Variance Std.Dev.
## Worker:Machine (Intercept) 13.909 3.730
## Worker (Intercept) 22.858 4.781
## Residual 0.925 0.962
## Number of obs: 54, groups: Worker:Machine, 18; Worker, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 52.36 2.49 8.52 21.06 1.2e-08
## MachineB 7.97 2.18 10.00 3.66 0.0044
## MachineC 13.92 2.18 10.00 6.39 7.9e-05
## ...
```

Let us first check whether the structure was interpreted correctly
(`Number of obs`

): There are 54 observations, coming from 6 different workers
and 18 (\(=3 \cdot 6\)) different combinations of workers and machines.

As usual, we have to be careful with the interpretation of the fixed effects
because it depends on the side constraint that is being used. For example, here
we can read off the output that the productivity score on machine `B`

is on
average 7.97 units larger than on machine `A`

. We can
also get the parameter estimates of the fixed effects only by calling
`fixef(fit.machines)`

(not shown).

Estimates of the different variance components can be found under
`Random effects`

. Often, we are not very much interested in the actual
values. We rather use the random effects as a “tool” to model correlated data
and to tell the fitting function that we are interested in the
population average and *not* the worker specific machine effects. In that
sense, including random effects in a model changes the interpretation of
the fixed effects. Hence, it really depends on the research question
whether we treat a factor as fixed or random. There is no right or wrong
here.

Approximate 95% confidence intervals can be obtained as usual with the
function `confint`

.

`confint(fit.machines, oldNames = FALSE)`

```
## 2.5 % 97.5 %
## sd_(Intercept)|Worker:Machine 2.353 5.432
## sd_(Intercept)|Worker 1.951 9.411
## sigma 0.776 1.235
## (Intercept) 47.395 57.316
## MachineB 3.738 12.195
## MachineC 9.688 18.145
```

For example, a 95% confidence interval for the expected value of the
difference between machine `A`

and `B`

is given by
\([3.7, 12.2]\).

We can do the residual analysis as outlined in Section 6.1.1.

```
## Tukey-Anscombe plot
plot(fit.machines)
```

```
## QQ-plots
par(mfrow = c(1, 3))
qqnorm(ranef(fit.machines)$Worker[, 1],
main = "Random effects of worker")
qqnorm(ranef(fit.machines)$'Worker:Machine'[, 1],
main = "Random interaction")
qqnorm(resid(fit.machines), main = "Residuals")
```

The Tukey-Anscombe plot looks good. The QQ-plots could look better; however, we do not have a lot of observations such that the deviations are still OK. Or in other words, it is difficult to detect clear violations of the normality assumption.

Again, in order to better understand the mixed model, we check what happens if we would fit a purely fixed effects model here. We use sum-to-zero side constraints in the following interpretation.

```
## Df Sum Sq Mean Sq F value Pr(>F)
## Machine 2 1755 878 949.2 <2e-16
## Worker 5 1242 248 268.6 <2e-16
## Machine:Worker 10 427 43 46.1 <2e-16
## Residuals 36 33 1
```

The machine effect is much more significant. This is because in the fixed
effects model, the main effect of machine makes a statement about the average
machine effect of these 6 *specific* workers and *not* about the population
average (in the same spirit as in the sire example in Section
6.1.1).

*Remark:* The model in Equation (6.4) with
\((\alpha\beta)_{ij} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha\beta}^2)\) is
also known as the so-called **unrestricted model**.
There is also a **restricted model** where we would
assume that a random interaction effect sums up to zero if we sum over the
*fixed* indices. For our example this would mean that if we sum up the interaction
effect for each worker across the different machines, we would get zero. Here,
the restricted model means that the random interaction term
does not contain any information about the general productivity level of a
worker, just about machine preference. Unfortunately, the restricted model is
currently not implemented in `lmer`

. Luckily, the inference with respect to the
*fixed* effects is the same for both models, only the estimates of the variance
components differ. For more details, see for example Burdick and Graybill (1992).

### 6.2.2 Example: Chocolate Data

We continue with a more complex design which is a relabelled version of Example 12.2 of Oehlert (2000) (with some simulated data). A group of 10 raters with rural background and 10 raters with urban background rated 4 different chocolate types. Every rater tasted and rated two samples from the same chocolate type in random order. Hence, we have a total of \(20 \cdot 4 \cdot 2 = 160\) observations.

```
book.url <- "http://stat.ethz.ch/~meier/teaching/book-anova"
chocolate <- read.table(file.path(book.url, "data/chocolate.dat"),
header = TRUE)
chocolate[,"rater"] <- factor(chocolate[,"rater"])
chocolate[,"background"] <- factor(chocolate[,"background"])
str(chocolate)
```

```
## 'data.frame': 160 obs. of 4 variables:
## $ choc : chr "A" "A" ...
## $ rater : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 ...
## $ background: Factor w/ 2 levels "rural","urban": 1 1 1 1 1 1 ...
## $ y : int 61 64 46 45 63 66 ...
```

Chocolate type is available in factor `choc`

(with levels `A`

, `B`

, `C`

and
`D`

), background of rater (with levels `rural`

and `urban`

) in `background`

, the
score in `y`

and a rater ID can be found in `rater`

. Note that the factor
`rater`

has only 10 levels. We have to be careful here and should not forget
that rater is actually *nested* in background; so, the raters labelled `1`

in
the urban and the rural group have *nothing* in common.

We can easily visualize this data with an interaction plot. We use the package
`ggplot2`

to get a more appealing plot compared to the
function `interaction.plot`

(`R`

code for interested readers only).

```
ggplot(chocolate, aes(x = choc, y = y,
group = interaction(background, rater),
color = background)) +
stat_summary(fun = mean, geom = "line") + theme_bw()
```

Questions that could arise are:

- Is there a background effect (urban vs. rural)?
- Is there a chocolate type effect?
- Does the effect of chocolate type depend on background (interaction)?
- How large is the variability between different raters regarding general chocolate liking level?
- How large is the variability between different raters regarding chocolate type preference?

Here, background and chocolate type are fixed effects, as we want to make a
statement about these *specific* levels. However, we treat rater as random, as we
want the fixed effects to be population average effects and we are interested
in the variation between different raters.

We can use the model \[\begin{equation} Y_{ijkl} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \delta_{k(i)} + (\beta\delta)_{jk(i)} + \epsilon_{ijkl}, \tag{6.5} \end{equation}\] where

- \(\alpha_i\) is the (fixed) effect of background, \(i = 1, 2\),
- \(\beta_j\) is the (fixed) effect of chocolate type, \(j = 1, \ldots, 4\),
- \((\alpha\beta)_{ij}\) is the corresponding interaction (fixed effect),
- \(\delta_{k(i)}\) is the random effect of rater: “general chocolate liking level of rater \(k(i)\) (nested in background!)”, \(k = 1, \ldots, 10\),
- \((\beta\delta)_{jk(i)}\) is the (random) interaction of rater and chocolate type: “specific chocolate type preference of rater \(k(i)\)”.

We use the usual assumptions for all random terms. In this setup, for both backgrounds there is (over the whole population) an average preference profile with respect to the 4 different chocolate types (given by \(\mu + \alpha_i + \beta_j + (\alpha\beta)_{ij}\)). Every rater can have its individual random deviation from this profile. This deviation consists of a general shift \(\delta_{k(i)}\) (general chocolate liking of a rater) and a rater specific preference \((\beta\delta)_{jk(i)}\) with respect to the 4 chocolate types. This is visualized in Figure 6.4 (for urban raters only, \(i = 2\)). The situation is very similar to the previous example about machines in Section 6.2.1. The solid line is the population average \((= \mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j})\). The dashed line is a parallel shift including the rater specific general chocolate liking level \((= \mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j} + \delta_{k(2)})\). The dotted line in addition includes the rater specific chocolate type preference \((\mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j} + \delta_{k(2)} + (\beta\delta)_{jk(2)})\). Therefore, the fixed effects have to be interpreted as population averages.

Let us now try to fit this model in `R`

. We want to have

- main effects and the interaction for the fixed effects of background
and chocolate type:
`background * choc`

- a random effect per rater (nested in background):
`(1 | rater:background)`

- a random effect per chocolate type and rater (nested in background):
`(1 | rater:background:choc)`

.

We always write `rater:background`

to get a unique rater ID. An
alternative would be to define another factor in the data set which
enumerates the raters from 1 to 20 (instead from 1 to 10).

Hence, the `lmer`

call (using package `lmerTest`

) looks as follows.

```
fit.choc <- lmer(y ~ background * choc + (1 | rater:background) +
(1 | rater:background:choc), data = chocolate)
```

As before, we get an ANOVA table with p-values for the fixed effects with the
function `anova`

.

`anova(fit.choc)`

```
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## background 263 263 1 18 27.61 5.4e-05
## choc 4219 1406 3 54 147.74 < 2e-16
## background:choc 64 21 3 54 2.24 0.094
```

We see that the interaction is not significant but both main effects are. This is what we already observed in the interaction plot in Figure 6.3. There, the profiles were quite parallel, but raters with an urban background rated higher on average than those with a rural background. In addition, there was a clear difference between different chocolate types.

Can we get an intuitive idea about the denominator degrees of freedom that are being used in these tests?

- The main effect of background can be thought of as a two-sample \(t\)-test with two groups having 10 observations each. Think of taking one average value per rater. Hence, we get \(2 \cdot 10 - 2 = 18\) degrees of freedom.
- As we allow for a rater specific chocolate type preference, we have to check whether the effect of chocolate is substantially larger than this rater specific variation. The rater specific variation can be thought of as the interaction between rater and chocolate type. As rater is nested in background, rater has 9 degrees of freedom in each background group. Hence, the interaction has (\((9 + 9) \cdot 3 = 54\)) degrees of freedom.
- The same argument as above holds true for the interaction between background and chocolate type.

Hence, if we want to get a more precise view about these population average effects, the relevant quantity to increase is the number of raters.

Approximate confidence intervals for the individual coefficients of the
fixed effects and the variance components could again be obtained by calling
`confint(fit.choc, oldNames = FALSE)`

(output not shown).

Most often, the focus is not on the actual values of the variance components. We use these random effects to change the meaning of the fixed effects. By incorporating the random interaction effect between rater and chocolate type, the meaning of the fixed effects changes. We have seen 20 different “profiles” (one for each rater). The fixed effects must now be understood as the corresponding population average effects. This also holds true for the corresponding confidence intervals. They all make a statement about a population average.

*Remark:* Models like this and the previous ones were of course fitted before
specialized mixed model software was available. For such a nicely balanced
design as we have here, we could in fact use the good old `aov`

function. In
order to have more readable `R`

code, we first create a new variable
`unique.rater`

which will have 20 levels, one for each rater (we could have done
this for `lmer`

too, see the comment above).

```
chocolate[,"unique.rater"] <- with(chocolate,
interaction(background, rater))
str(chocolate)
```

```
## 'data.frame': 160 obs. of 5 variables:
## $ choc : chr "A" "A" ...
## $ rater : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 ...
## $ background : Factor w/ 2 levels "rural","urban": 1 1 1 1 1 1 ...
## $ y : int 61 64 46 45 63 66 ...
## $ unique.rater: Factor w/ 20 levels "rural.1","urban.1",..: 1 1 1 1 1 1 ...
```

Now we have to tell the `aov`

function that it should use different so-called
error strata. We can do this by adding `Error()`

to the model formula. Here, we
use `Error(unique.rater/choc)`

. This is equivalent to adding a random effect for
each rater `(1 | unique.rater)`

and a rater-specific chocolate effect `(1 | unique.rater:choc)`

.

```
fit.choc.aov <- aov(y ~ background * choc +
Error(unique.rater/choc), data = chocolate)
summary(fit.choc.aov)
```

```
##
## Error: unique.rater
## Df Sum Sq Mean Sq F value Pr(>F)
## background 1 5119 5119 27.6 5.4e-05
## Residuals 18 3337 185
##
## Error: unique.rater:choc
## Df Sum Sq Mean Sq F value Pr(>F)
## choc 3 12572 4191 147.74 <2e-16
## background:choc 3 190 63 2.24 0.094
## Residuals 54 1532 28
## ...
```

We simply put the random effects into the `Error()`

term and get the same
results as with `lmer`

. However, the approach using `lmer`

is much more
flexible, especially for unbalanced data.

### 6.2.3 Outlook

Both the machines and the chocolate data were examples of so-called **repeated
measurement data**. Repeated measurements occur if
we have multiple measurements of the response variable from each experimental
unit, e.g., over time, space, experimental conditions, etc. In the last two
examples, we had observations across different experimental conditions
(different machines or chocolate types). For such data, we distinguish between
so-called between-subjects and within-subjects factors.

A **between-subjects factor** is a factor that splits the subjects into
different groups. For the machines data, such a thing does not exist. However,
for the chocolate data, background is a between-subjects factor because it
splits the 20 raters into two groups: 10 with rural and 10 with urban
background. On the other hand, a **within-subjects
factor** splits the
observations from an individual subject into different groups. For the machines
data, this is the machine brand (factor `Machine`

with levels `A`

, `B`

and `C`

).
For the chocolate data, this is the factor `choc`

with levels `A`

, `B`

, `C`

and
`D`

. Quite often, the within-subjects factor is time, e.g., when investigating
growth curves.

Why are such designs popular? They are of course needed if we are interested in individual, subject-specific, patterns. In addition, these designs are efficient because for the within-subjects factors we block on subjects. Subjects can even serve as their own control!

The `lmer`

model formula of the corresponding models all follow a similar
pattern. As a general rule, we always include a random effect for each subject
`(1 | subject)`

. This tells `lmer`

that the values from an individual subject
belong together and are therefore correlated. If treatment is a
within-subjects factor and if we have multiple observations from the same
treatment for each subject, we will typically introduce another random effect
`(1 | subject:treatment)`

as we did for the chocolate data. This means that the
effect of the treatment is slightly different for each subject. Hence, the
corresponding fixed effect of treatment has to be interpreted as the population
average effect. If we do not have replicates of the same treatment within the
subjects, we would just use the random effect per subject, that is `(1 | subject)`

.

If treatment is a between-subjects factor (meaning we randomize treatments to
subjects) and if we track subjects across different time-points (with one
observation for each time-point and subject) we could use a model of the form `y ~ treatment * time + (1 | subject)`

where `time`

is treated as a factor; we are
going to see this model again in Chapter 7. Here,
if we need the interaction between `treatment`

and `time`

it means that the
time-development is treatment-specific. Or in other words, each treatment would
have its own profile with respect to time. An alternative approach would be to
aggregate the values of each subject (a short time-series) into a single
meaningful number determined by the research question, e.g., slope, area under
the curve, time to peak, etc. The analysis is then much easier as we only have
one observation for each subject and we are back to a completely randomized
design, see Chapter 2. This means we would
not need any random effects anymore and `aov(aggregated.response ~ treatment)`

would be enough for the analysis. Such an approach is also called a **summary
statistic** or **summary measure
analysis** (very easy and very powerful!). For
more details about the analysis of repeated measurements data, see for example
Fitzmaurice, Laird, and Ware (2011).

In general, one has to be careful whether the corresponding statistical tests are performed correctly. For example, if we have a between-subjects factor like background for the chocolate data, the relevant sample size is the number of subjects in each group, and not the number of measurements made on all subjects. This is why the test for background only used 18 denominator degrees of freedom. Depending on the software that is being used, these tests are not always performed correctly. It is therefore advisable to perform some sanity checks.

We have only touched the surface of what can be done with the package `lme4`

(and `lmerTest`

). For example, we assumed independence between the different
random effects. When using a term like `(1 | treatment:subject)`

, each subject
gets independent random treatment effects all having the same variance. One
can also use `(treatment | subject)`

in the model formula. This allows not only
for correlation between the subject specific treatment effects, but also for
different variance components, one component for each level of `treatment`

. A
technical overview also covering implementation issues can be found in Bates et al. (2015).
With repeated measurements data, it is quite natural that we are faced with
serial (temporal) correlation. Implementations of a lot of correlation
structures can be found in package `nlme`

(Pinheiro et al. 2021). The corresponding book
(Pinheiro and Bates 2009) contains many examples. Some implementations can also
be found in package `glmmTMB`

(Brooks et al. 2017).