This tutorial is one of a series that accompanies An Adventure in Statistics (Field 2016) by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License,1
Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don’t have the book I begin each tutorial with a précis of the story. If you’re not interested then fair enough - click past this section.
It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.
Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.
But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.
Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions
Everyone has a star, a limitless space on which to store their digital world.
Zach and Alice are Clocktarians. Their technology consists mainly of:
Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.
On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.
Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.
Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.
Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.
This tutorial uses the following packages:
boot to conduct bootstrapping (Canty and Ripley 2017)Hmisc to obtain confidence intervals (Harrell 2017)tidyverse for general data processing (Wickham 2017)These packages are automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.
Note that boot is installed by default in R, but you will need to execute library(boot) to use it.
This tutorial has the data files pre-loaded so you shouldn’t need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:
You can load the file in several ways:
jig_tib by executing:
jig_tib <- readr::read_csv("../data/ais_c05_jigsaw.csv")jig_tib <- readr::read_csv("http://www.discoveringstatistics.com/repository/ais_data/ais_c05_jigsaw.csv")After breaking into the JIG:SAW complex and catching a glimpse of Alice, Zach is in shock. Returning home he is desolate: the evidence of his eyes suggests that Alice is not being held hostage at JIG:SAW and the reality kicks in that she may in fact have left him. He feels worthless. He meets up with Nick, the drummer in his band, at the Repository and they discuss what has happened. Nick is utterly convinced that Alice wouldn’t have left Zach out of choice and Zach returns home even more determined to get back into JIG:SAW to speak to her and find out for himself. Milton is reluctant and tries to convince Zach that he would need to step up his understanding of statistics to a level of which he is not capable. Based on data showing that performing under a different name in a maths test can free mental blocks to maths, it is decided that Zach needs to dress up as Florence Nightingale - a ‘gifted statistician’ as well as a nurse. Feel free to do the same.
Milton goes on to explain confidence intervals to Zach.
In a previous tutorial we looked at how to summarize data to get summary statistics such as the mean, median, variance and so on. It is often useful to get the confidence interval too. This section looks at how to do that. We’re going to use the data in the tibble called jig_tib again, which contains data about the strength, speed, and visual acuity of JIG:SAW employees and non-employees. Remember that this tibble contains 7 variables:
In the previous tutorial we looked at combining group_by() and summarize() in a pipe command to create a tibble containing summary statistics (split by group). For example, we used the following command to create a tibble called speed_sum that contained the mean of the variable footspeed grouped by combinations of the variables employee and sex from the tibble called jig_tib.
speed_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_speed = mean(footspeed)
    )To incorporate confidence intervals into this command isn’t entirely straightforward. One way is to use the mean_cl_normal() function from the ggplot2 package. (Under the hood this function is using the package Hmisc so if you’re working outside of this tutorial then you’ll need to install and load this package.) This function takes any model as its input and returns an object that contains estimates from the model and their confidence intervals. In general it takes the following form:
ggplot2::mean_cl_normal(object, conf.int = 0.95, na.rm = TRUE)
To keep things simple, imagine the ‘object’ we put into this function is a variable, like footspeed. We’d execute:
ggplot2::mean_cl_normal(jig_tib$footspeed, conf.int = 0.95, na.rm = TRUE)
If we place only jig_tib$footspeed into the function we’ll get a 95% confidence interval and missing values will be removed, which is usually what we want. To get a different confidence interval specify a proportion in conf.int =. For example, ggplot2::mean_cl_normal(jig_tib$footspeed, conf.int = 0.9) will give us a 90% confidence interval around the mean of footspeed. When you put a variable into the function it returns an object containing the mean (labelled y in the object), the lower boundary of the confidence interval (labelled ymin), and the upper boundary of the confidence interval (labelled ymax). To access these values simultaneously we execute the function (or create an object from it and execute that), but if we want to access them individually we can do that by appending their name after the $ symbol. For example:
ggplot2::mean_cl_normal(jig_tib$footspeed)$ymin
returns the value of the lower boundary of the (by default 95%) confidence interval. If we add this command within the summarize() statement in our pipe, we’ll get the lower boundaries for the confidence intervals around the mean when the data are grouped by employee and sex. One final point is that we wouldn’t include this line exactly as I’ve written it because we have specified the jig_tib tibble earlier in the pipe, which means it will be carried forward into the summarize statement. This means that we can specify the variable of interest as simply footspeed rather than jig_tib$footspeed because the function will already know to look for footspeed within the jig_tib tibble.
Therefore, to create a tibble with means and confidence intervals grouped by employee and sex we can adapt the code that we’ve already used in previous tutorials to include the mean_cl_normal() function:
speed_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_speed = mean(footspeed),
    ci_low = ggplot2::mean_cl_normal(footspeed)$ymin,
    ci_upp = ggplot2::mean_cl_normal(footspeed)$ymax
    )Note that the code is the same as we used before except that we have added two lines (which should make sense based on the explanation above):
ci_low = ggplot2::mean_cl_normal(footspeed)$ymin creates a variable called ci_low by applying the function mean_cl_normal() to the variable footspeed and extracting the value labelled “CI lower”, which is the value of the lower boundary of the 95% confidence interval.ci_upp = ggplot2::mean_cl_normal(footspeed)$ymax creates a variable called ci_upp by applying the function mean_cl_normal() to the variable footspeed and extracting the value labelled “CI upper”, which is the value of the upper boundary of the 95% confidence interval.Copy the command into the code box and run it. Also, run the name of the speed_sum tibble that you create to see what it contains. Hopefully you’ll see a tibble with the mean and 95% confidence interval boundaries split by employee and sex. Try to create tibbles called strength_sum and vis_sum that tabulate the mean and confidence intervals for the variables strength and vision respectively.
speed_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_speed = mean(footspeed),
    ci_low = ggplot2::mean_cl_normal(footspeed)$ymin,
    ci_upp = ggplot2::mean_cl_normal(footspeed)$ymax
    )
speed_sumWe’ve seen in previous tutorials how to plot means. For example we created a plot of the strength data split by JIG:SAW employees and non-employees. We used this code:
strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +
  labs(x = "Employee status", y = "Maximal push force (n)") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  theme_bw()Let’s remind ourselves of what each part of this command is doing:
strength_plot <- ggplot(jig_tib, aes(employee, strength)) creates an object called strength_plot that contains the plot. The ggplot() function is then used to specify that the plot uses the jig_tib tibble and plots the variable employee on the x-axis and the variable strength on the y-axis.strength_plot + stat_summary(fun = "mean", geom = "point", size = 4) takes the object strength_plot and adds a statistics summary to it. The stat_summary() uses fun = "mean" to plot the mean using a dot (geom = "point") of size 4 (size = 4).labs(x = "Employee status", y = "Maximal push force (n)") applies descriptive labels to the x and y axes.coord_cartesian(ylim = c(1000, 1800)) sets the limits of the y-axisscale_y_continuous(breaks = seq(1000, 1800, 100)) sets the breaks along the y-axistheme_bw() applies a black and white theme to the plotIf we want to plot confidence intervals around the means all we need to do is to edit the details of stat_summary(). Specifically there are three changes that we need to make:
fun = "mean" to fun.data = "mean_cl_normal". This creates the data to be plotted for the mean and confidence intervals.geom = "point" to geom = "pointrange". This changes the geom to one that can display a confidence interval.size = 4 because the default size for the ‘pointrange’ geom will look better than size 4.The code box contains the code above and shows the resulting plot (that we created in previous tutorials). Change the two things listed above and run the code to see how the plot changes.
strength_plot <- ggplot(jig_tib, aes(employee, strength))
strength_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +
  labs(x = "Employee status", y = "Maximal push force (n)") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  theme_bw() 
strength_plot <- ggplot(jig_tib, aes(employee, strength))
strength_plot +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
  labs(x = "Employee status", y = "Maximal push force (n)") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  theme_bw()Hopefully your plot how looks like dots showing the means with vertical lines showing the confidence intervals. (Incidentally, you could layer a stat_summary() function underneath the confidence intervals that uses a bar geom to plot the mean. Something like stat_summary(fun = "mean", geom = "bar", alpha = 0.6) + placed on the line above the existing stat_summary() function would work. The result would be a bar graph of means with error bars layered on top. However, adding the bars does not add any new information, but does add a lot of unnecessary ink, so I wouldn’t bother.)
What about more complex plots? In a previous tutorial we also created a similar plot but differentiated males and females by colours. The code box below replicates this code and shows the plot to remind you of how it looks. To add error bars we make exactly the same changes to stat_summary() as we did above:
fun = "mean" to fun.data = "mean_cl_normal".geom = "point" to geom = "pointrange".size = 4.Make these changes and run the code. You should see the same plot but with confidence intervals as well as means.
strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
  theme_bw() 
strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
  theme_bw()In the box below can you create a similar plot of the vision scores split by job_type on the x-axis and employees and non-employees in different colours.
vis_plot <- ggplot(jig_tib, aes(job_type, vision, colour = employee))
vis_plot +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  labs(x = "Type of job", y = "Visual acuity", colour = "Employee status") +
  coord_cartesian(ylim = c(0, 1)) +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
  theme_bw()In Chapter 9, Zach steals the data we have been analysing that relates to JIG:SAW employees. Milton teaches him how the mean and confidence intervals can be biased by skew, kurtosis and outliers. Milton introduces the idea of the trimmed mean.
To get the trimmed mean for a variable we can adapt the code we have already used. For example, to get the mean footspeed split by employee and sex we used the code below. We discovered before that the mean has an option to specify a trim (trim = 0), so to produce a summary of trimmed means we can simply add this option to the command that gets the mean. For example, to get a 20% trimmed mean we might replace mean_speed = mean(footspeed) with:
tr_mean_speed = mean(footspeed, trim = 0.2)
Try adding this line to the summarize() function in the code box so that we store both the mean and the 20% trimmed mean (don’t forget to include a comma after the first command within summarize())
speed_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_speed = mean(footspeed)
    )
speed_sumspeed_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_speed = mean(footspeed),
    tr_mean_speed = mean(footspeed, trim = 0.2)
    )
speed_sumTry this for strength as well:
strength_sum <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_strength = mean(strength),
    tr_mean_strength = mean(strength, trim = 0.2)
    )
strength_sumMilton also introduces Zach to the idea of bootstrapping as a way to create robust confidence intervals.
We can get bootstrap confidence intervals in R by using the boot package, which is installed by default. However, you do need to load it before using it if you are working outside of this tutorial environment. The process of getting the bootstrap intervals is - and I’m not going to lie - complicated. You should not feel bad if this section makes no sense whatsoever (but by all means feel a sense of pride if you get through it in one piece).
The boot() function works in the following way: it takes a bootstrap sample from the data you specify, sends it out to a function (that the user defines) that returns the statistic of interest (in this case the mean), it stores this information and the repeats the process lots of times. The general format of the boot() function is:
boot::boot(data, statistic, R)
It has other options too, but the three key ones are data which specifies the data from which to take bootstrap samples, statistic which specifies a function that computes the information you want (e.g., the mean), and R which is the number of times you want to repeat the process. So, to get bootstrap samples of the strength data we could specify something like this:
boot::boot(jig_tib$strength, statistic, R = 1000)
This specifies that the data from which to take bootstrap samples is jig_tib$strength and that we want to repeat the process 1000 times (i.e. use 1000 bootstrap samples). The question is what do we replace statistic with. Given we’re interested in the mean, you might reasonably think we can replace statistic with mean() or something. However, we can’t do this because the function mean() will not return information about which bootstrap sample was used. Instead we need to write out own function that will include information about the particular bootstrap sample that is being used to compute the mean. We do this by executing:
mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
This creates a new function called mean_boot(). This function takes data and an index value as input function(data, index) and it returns the mean of the data with a particular index value after removing missing values mean(data[index], na.rm = TRUE). In the main boot() function we would replace statistic with this function that we have created:
boot::boot(jig_tib$strength, mean_boot, R = 1000)
To sum up, the boot() function is going to send data (the bootstrap samples) with an associated index value (from 1 to 1000) to the function called mean_boot. For each data set with each index value mean_boot will return the mean.
The code box contains these commands. Run the code.
mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
boot::boot(jig_tib$strength, mean_boot, R = 1000)The output doesn’t give us the confidence intervals. Instead we’re told the original mean and an estimate of bias. To get the confidence intervals there is one more step, which is to use the boot.mean_cl_normal() function, which takes the general form:
boot::boot.ci(boot.object, conf = 0.95, type = "all")
There are other options but these ones are the main ones. First you input an object created with the boot() function, then specify the width of confidence interval that you want (the default is 0.95 so can leave this option out if that’s what you want), and the type that you want. By default it will compute 5 different types of interval. For most purposes you don’t need five different intervals so it would be best to change this option to specify a particular type. There’s much to recommend Bias Corrected and Accelerated intervals (type = "bca") or if those fail then a percentile interval (type  = "perc").
The code box below combines what we have learnt. First it creates the function to get the mean. The second line inserts our boot() function into the boot.mean_cl_normal() function and asked for BCa confidence intervals. Run the code.
mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
boot::boot.ci(boot(jig_tib$strength, mean_boot, R = 1000), type = "bca")There is too much output to embed into a summarize function. We want to discard everything except for the numeric values that represent the lower and upper bound of the confidence interval. These values are stored in a variable called bca within the boot.mean_cl_normal() object. (Obviously if you chose a type of interval other than bca, the variable name will be different; for example, if you opted for type = "Perc" then the variable will be called perc). Like any variable, we can access it using $. For example, to see the information relating to the BCa confidence intervals we could execute:
boot::boot.ci(boot(jig_tib$strength, mean_boot, R = 1000), type = "bca")$bca
Try it in the code box above and you’ll see 5 values appear.Values 4 and 5 are the lower and upper limits of the interval. We can access these specific values using:
boot::boot.ci(boot(jig_tib$strength, mean_boot, R = 1000), type = "bca")$bca[4] boot::boot.ci(boot(jig_tib$strength, mean_boot, R = 1000), type = "bca")$bca[5]
Try these commands in the code box above. We can include these lines in the summarize function of the pipe that we used at the start of the tutorial. Remember that we can drop the jig_tib$ from the command because that tibble is specified earlier in the pipe. The code box below shows the commands.
mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
strength_mean <-
  jig_tib %>%
  dplyr::group_by(employee, sex) %>%
  dplyr::summarize(
    mean_strength = mean(strength),
    ci_low_boot = boot::boot.ci(boot(strength, mean_boot, R = 1000), type = "bca")$bca[4],
    ci_upp_boot = boot::boot.ci(boot(strength, mean_boot, R = 1000), type = "bca")$bca[5]
    )
strength_meanRun this code and you should see that it creates a tibble with the means, the lower limit of the 95% bootstrap confidence interval for the mean (ci_low_boot) and the upper limit of the 95% bootstrap confidence interval for the mean (ci_upp_boot) split b all combinations of employee and sex.
The last section was, I imagine, not the most fun that you have ever had. I expect you are dreading the added complexity of plotting these bootstrap intervals. Well, fear not. Earlier on we plotted this graph:
strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
  coord_cartesian(ylim = c(1000, 1800)) +
  scale_y_continuous(breaks = seq(1000, 1800, 100)) +
  scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
  theme_bw()To convert these confidence intervals to bootstrap confidence intervals requires only one change to the command:
stat_summary() function change mean_cl_normal to mean_cl_boot.It’s a simple as that. Make the change in the code box above and re-run the code to observe the change on the plot.
Canty, Angelo, and Brian Ripley. 2017. “Boot: Bootstrap R (S-Plus) Functions.” Journal Article. R Package Version 1.3-19. https://CRAN.R-project.org/package=boot.
Field, Andy P. 2016. An Adventure in Statistics: The Reality Enigma. Book. London: Sage.
Field, Andy P., Jeremy N. V. Miles, and Zoe C. Field. 2012. Discovering Statistics Using R: And Sex and Drugs and Rock ’N’ Roll. Book. London: Sage.
Harrell, F E. 2017. “Hmisc: A Package of Miscellaneous R Functions.” Journal Article. R Package Version 4.0-3. https://CRAN.R-project.org/package=Hmisc.
Wickham, Hadley. 2017. “Tidyverse: Easily Install and Load ’Tidyverse’ Packages.” Journal Article. R Package Version 1.1.1. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and G. Grolemund. 2017. R for Data Science. Book. Sebastopol, CA: O’Reilly.
Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.↩