This tutorial is one of a series that accompanies An Adventure in Statistics (Field 2016) by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License,1
Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don’t have the book I begin each tutorial with a précis of the story. If you’re not interested then fair enough - click past this section.
It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.
Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.
But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.
Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions
Everyone has a star, a limitless space on which to store their digital world.
Zach and Alice are Clocktarians. Their technology consists mainly of:
Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.
On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.
Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.
Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.
Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.
This tutorial uses the following packages:
tidyverse (Wickham 2017)This package is automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.
This tutorial has the data files pre-loaded so you shouldn’t need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:
You can load the file in several ways:
ras_tib by executing:
ras_tib <- readr::read_csv("../data/ais_04_ras.csv")ras_tib <- readr::read_csv("http://www.discoveringstatistics.com/repository/ais_data/ais_04_ras.csv")In chapter 4 of the book Milton and Zach retreat to Occam’s cafe where Milton helps Zach to explore some of the data that he stole from Alice’s file when he broke into her counsellor’s office. Still working on the hypothesis that Alice left him because she was dissatisfied with their relationship, Zach and Milton look at Alice’s responses on the relationship assessment scale (RAS), which is a 7-item scale measuring relationship satisfaction. Each item has a 5-point response scale yielding a total score that ranges from 7 (completely dissatisfied) to 35 (completely satisfied). Alice completed this measure each of the 10 weeks if her counselling. The data are in a tibble called ras_tib, use the code box to view this tibble.
ras_tibYou should see a tibble with two variables: week indicates in which week of counselling (from 1 to 10) Alice completed the RAS and rating shows her total score on the RAS. In the book Milton shows Zach how to compute the mean, median, variance, standard deviation and inter-quartile range of these scores. We can do the same in R using these commands: mean(), median, var(), sd(), and IQR().
We’ve seen already that functions take the form of a command followed by brackets. We also saw that there are usually options that can be placed within those brackets (for example, in the previous tutorial we changed the binwidth and bar colour of the histogram and line size and style of a frequency polygon). Let’s look at this idea in more detail with the function mean(). The full format of the function is:
mean(variable, trim = 0, na.rm = FALSE)
Which just says that you need to include a reference to the data that you want the mean for, and that you can set two options:
The function for the median has a similar format except that there isn’t an option to trim the data because that wouldn’t make sense (the median is effectively the data with a 50% trim):
median(variable, na.rm = FALSE)
If you are happy with the default settings you don’t need to specify those options explicitly. For example, to find the mean of Alice’s RAS scores you can execute:
mean(ras_tib$rating)
However, if you wanted to remove missing values you would need to execute to override the default setting not to remove missing values:
mean(ras_tib$rating, na.rm = T)
Remember from the last tutorial that because the ratings are in the variable rating, and that variable is in the tibble ras_tib we access them with ras_tib$rating, which translates as ‘the variable rating within the tibble ras_tib’. The na.rm = T tells R to remove missing values before computing the mean.
In the code box execute commands to get the mean and median of Alice’s RAS ratings.
Tip: To find the list of options available for a particular function remember that you can get help by executing ? and the name of the function (e.g., ?mean).
mean(ras_tib$rating)
median(ras_tib$rating)In the book, Zach and Milton spot an unusual score (an outlier) in Alice’s data by plotting a histogram. Using what you have learnt from previous tutorials plot a histogram of Alice’s RAS scores. (Clicking solution will reveal code for a fairly nicely-formatted histogram, don’t feel that you need to be as thorough as that!)
ras_hist <- ggplot2::ggplot(ras_tib, aes(rating))
ras_hist +
  geom_histogram(binwidth = 1, fill = "#56B4E9", colour = "#56B4E9", alpha = 0.6) +
  labs(x = "Relationship assessment score (7-35)", y = "Frequency") +
  coord_cartesian(xlim = c(7, 35)) +
  scale_x_continuous(breaks = 7:35) +
  theme_bw()The outlier is the score of 11 (Alice’s score in the final week of counselling - the week before she vanished). Using what you’ve learnt from previous tutorials, create a new tibble called ras_no_outlier that excludes data in the week where Alice’s RAS rating was 11?
#This is one way:
ras_no_outlier <- ras_tib %>% filter(rating > 11)
#This would do the same thing:
ras_no_outlier <- ras_tib %>% filter(rating != 11)Recalculate the mean and median rating without the outlier of 11 included (i.e., use the tibble that you have just created).
mean(ras_no_outlier$rating)
median(ras_no_outlier$rating)You should find that the mean has increased from 28.1 to 30 (because the score of 11 no longer drags it down) but the median remains 30. This example illustrates that the median is more robust than the mean to outliers.
Milton also calculates the variance, standard deviation and inter-quartile range (IQR) of Alice’s ratings (without the outlier). Using R we can get these values using the functions var(), sd() and IQR() respectively. These functions behave exactly like mean() in that we input the variable for which we want the variance, standard deviation or IQR and specify how we treat missing values (by default they are not removed):
var(variable_name, na.rm = FALSE)sd(variable_name, na.rm = FALSE)IQR(variable_name, na.rm = FALSE, type = 7)The IQR() function has an additional option of type = which allows you to specify one of 8 different ways to calculate the IQR. The default is 7. If you want to mimic what IBM SPSS Statistics does then you’d need to over-ride this default with type = 6, and there is an argument for using type  = 8, which uses a method recommended by (Hyndman and Fan 1996). In the code box below, use these functions to get the variance, standard deviation and IQR for Alice’s ratings (with the outlier excluded).
var(ras_no_outlier$rating)
sd(ras_no_outlier$rating)
IQR(ras_no_outlier$rating, type = 8)Now try computing these statistics for the original data (i.e. with the outlier included).
var(ras_tib$rating)
sd(ras_tib$rating)
IQR(ras_tib$rating, type = 8)Note that with the outlier included the variance increases from 1.5 to a whopping 37.43, which illustrates the dramatic impact that outliers have on the variance. In contrast, the IQR is relatively unaffected (it increases from 2 to 2.08) because it is the boundaries of the middle 50% of scores and so largely ignores the top and bottom 25% of scores (where the outliers dwell).
Field, Andy P. 2016. An Adventure in Statistics: The Reality Enigma. Book. London: Sage.
Field, Andy P., Jeremy N. V. Miles, and Zoe C. Field. 2012. Discovering Statistics Using R: And Sex and Drugs and Rock ’N’ Roll. Book. London: Sage.
Hyndman, R. J., and Y. Fan. 1996. “Sample Quantiles in Statistical Packages.” Journal Article. American Statistician 50: 361–65.
Wickham, Hadley. 2017. “Tidyverse: Easily Install and Load ’Tidyverse’ Packages.” Journal Article. R Package Version 1.1.1. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and G. Grolemund. 2017. R for Data Science. Book. Sebastopol, CA: O’Reilly.
Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.↩