In previous posts, I have talked about the value of knowing a scripting language, like R, for statistical analysis. As an open sourced software, R allows you to do advanced statistical analysis and build robust models for prediction and analysis, in addition to being an excellent tool for data wrangling and data visualization. However, the biggest barrier for entrance for most people is learning the language itself, but that doesn’t need to be the case.
Scripting languages are often approached by learning the grammar of the language first through drills, before eventually getting to statistical analysis and visualization. What’s interesting to me about that approach is it’s not at all how people learn a language. When we learn to speak – or learn a new language in general- we typically do so through imitation, experimentation, and a whole lot of trial and error. Learning a scripting language is the same. This blog post aims to help users get up and running quickly in R with some simple code that can be adapted for statistical analysis. All charts and analysis can be replicated using the code chunks below or the raw code and data. If possible, I suggest an Interactive Development Environment (IDE) like RStudio.
To get started, I have found it’s easiest to take a linear model approach, such as:
function( Y ~ X, data = DataSetName )
Below is a short description of each part in the pseudo-code above:
- Y is the outcome of interest (response variable)
- X is some explanatory variable (or you can use “1” as a placeholder if there is no explanatory variable)
- DataSetName represent data loaded into the R environment
Note that an R function dictates something you want to do with your data.
Setting up your environment
Most people that use open source programs like Python or R will agree on one thing: the packages are what make them great. R, in its most basic format – which is often referred to as “Base R” – works like a really big calculator. Base R comes with a number of functions for statistical analysis and plotting, but since R is open source, there is a large community of creators who share packages. This expands the capability of the program tremendously.
There are two steps when you want to use a package:
- Install the package from the source
- Call the package for use in your program
The best analogy I have heard is if you think of an R package like a song you buy online. You buy each song one time to add them to your music library. If you want to make a playlist, you need to go to your music library and select the songs you want to hear. R works the same way, but with one major difference: everything is free. Below is how you would install and load the packages necessary for this analysis:
# Note: the hashtag character (#) designates user comments... Don't afraid to use them *very* generously to document your code. # Install Packages (you only have to do this once) install.packages("tidyverse") install.packages("mosaic") # Load Packages library(tidyverse) library(mosaic)
For this post, I have included a dataset of from my GitHub account of two different half marathon training seasons: one from 2020 & one from 2021. I used the same running app for both of them, which had a consistent structure to the training plans, allowing for a number of comparisons to be made. For this post, we will keep things simple and focus only on these 4 variables to demonstrate a variety of functions, visualizations, and tests:
- Attempt – Categorical variable indicating if a run was on the first or second attempt with the program.
- Distance – Continuous variable representing the distance, measured in miles.
- Workout – Categorical variable representing the four possible workout types: Easy, Intervals, Long, Race & Segments.
- Session – Ordinal variable, numbered 1-46, representing which run (i.e “session) it was within the program. Note: given the large number of levels, we will treat this as a continuous variable for these examples.
Loading & Viewing Data
Thanks to the open source nature of this software, there are packages and functions for nearly every type of data file. In addition, there are some simple functions that allow you to inspect the data in a variety of ways. The best part is it doesn’t take too much coding:
# Data intake: Running <- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv") # View the data frame: Running
In the database above, we can see that there are 92 rows (i.e. observations) on 13 variables. As long as you have loaded tidyverse into your IDE (i.e. RStudio), you should be able to scroll freely through the data interactively. Typically, we just want to peek at the data though, which you can do these functions:
- names() – Shows the names of the headers (i.e. variable names)
- head() – Shows the first several rows of a matrix or data frame
- tail() – Shows the first several rows of a matrix or data frame
- glimpse() – Transposed version of print, making it possible to see every column in a data frame.
- str() – Transposed version of print, showing only the first few rows (i.e. similar to the head() function, but listed horizontally, instead of vertically).
More times than not, we need to do some manipulation of the data before we do any kind of analysis. If I am being honest, cleaning, joining, and reshaping data typically makes up about 80% of my time on data projects. R can be great for this, and one helpful function from the Tidyverse is the “pipe operator.” The pipe operator (%>%) allows you to think linearly and “pipe” data through different functions, almost endlessly. This can be really helpful with filtering, mutating, reshaping, and cleaning data.
Below is a simple example of the pipe operator in action; selecting only the four variables of interest: Attempt, Session, Workout, & Distance. From there, the head() function shows the first few rows of data, demonstrating how we have only those variables of interest:
# Select variables of interest and overwrite "Running" dataframe Running <- Running %>% select(Attempt, Session, Workout, Distance) ## Note the "pipe operator (i.e. %>%) above. This is a great tool for "piping new # view just the first few rows to confirm only the variables of interest head(Running)
Base R has a number of built in functions for summarizing data. These can come in handy when needing to make quick calculations, and work in the linear format we have referenced above. In this example we are calculating the mean of the Distance variable within the Running data frame:
mean(Distance ~ 1, data = Running)
We can also use this approach – note the $ symbol connecting the variable within the data frame – yielding the same result:
Some other summary statistic functions that are built into base R include:
- mean() – Calculates the arithmetic mean (i.e. “average”) of the column selected
- median() – Calculates the median of the column selected
- mode() -Determines the mode of the column selected
- min() – Determines the minimum value of the column selected
- max() – Determines the maximum value of the column selected
- sd() – Calculates the standard deviation of the column selected
- sum() – Calculates the sum of the column
- range() – Distance between the highest and lowest data points of the column selected
- iqr() – Displays the interquartile range (middle 50% of the data) of the column selected
Instead of using the options above, the Mosaic package contains a number of functions for computation, calculus, statistics, & modeling. For example, the favstats() function does a number of summary statistics, including a five number summary (min, first quartile, median, third quartile, & max), along with standard deviation, mean, number of missing observations, and total number of observations:
# Five Number Summary using favstats function: favstats(Running$Distance) # Same thing coded another way (consistent with the format of later examples): favstats(Distance ~ 1, data = Running)
The favstats() function also allows you to summarize by groups. In this example, the same statistics are calculated for the Distance variable by the 5 different workout types:
# Favstats, separated by Workout type favstats(Distance ~ Workout, data = Running)
One of R’s greatest advantages is its ability to create customized visualizations. When you incorporate the pipe operator, you can write in layers, adding more detail with each one. You simply start with the kind of chart you want to use, like:
- gf_histogram() – Plots a histogram
- gf_density() – Generates a density plot
- gf_boxplot() – Creates a boxplot
- gf_violin() – Generates a violin plot
- gf_point() – Creates a scatterplot
Then “pipe” the customizations you want to add:
- gf_labs() – Adds labels to the plot
- gf_theme() – Allows user customize layout & themes
- gf_lm() – Adds an Ordinary Least Squares (OLS) line to the plot
- gf_smooth() – Adds a smoothing function to the OLS line to account for curvature in the data.
- geom_jitter() – Add noise to a numeric vector to remove overlaps (ie.”to break ties”)
Below are some examples of basic data visualizations using the ggformula functions listed above. Notice with each chart how these charts become increasingly customized by using the functions above:
# Plot histogram of Distance variable: gf_histogram(~Distance, data = Running)
# Density plot of Distance variable, adding a title: gf_density(~Distance, data = Running) %>% gf_labs(title = "Distances Ran")
# Boxplot of Distance by Attempt; adding subtitle & caption gf_boxplot(Distance ~ Attempt, data = Running) %>% gf_labs(title = "Boxplots of Distance by Workout", subtitle = "Half Marathon Running Data", caption = "Up & Running in R")
Basic Statistical Modeling
Now that we have the basic linear approach to coding in R, we can pair visualizing with modeling to gain a clearer picture of the data. Below are some examples of some basic statistical tests with the variables from the Running dataset.
Two Sample T-Test
The variable ‘Attempt’ refers to whether or not the run was on the first or second attempt at the program. The running plan was based on time intervals, not mileage, so looking at distance between attempts would create a logical comparison. Given that ‘Distance’ is a continuous variable and Attempt is a categorical variable with two levels, we can evaluate this model using the t.test() function.
# Two Sample T-test between Distance and Attempt t.test(Distance ~ Attempt, data = Running)
Note – Since the linear approach is an additive model, it is possible to collapse it all the way down to a t.test, giving us the same results:
# Same test, but using a linear model approach, yielding the same result: model_1 <- lm(Distance ~ Attempt, data = Running) summary(model_1)
Visualizing these variables can be done with box plots, or through violin plots like in the chart below. Violin plots work like a hybrid of a box plot and a kernel density plot, showing the peaks and valleys in the data:
# Violin plot of Distance by Attempt; adding caption gf_violin(Distance ~ Attempt, data = Running) %>% gf_labs(title = "Violin Plots of Distances Ran", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Simple Linear Regression
To investigate distances ran over time, we can plot (and test) these using an Ordinary Least Squares (OLS) model (i.e. “regression”). In this example, we have Distance as the dependent (i.e. “outcome”) variable and Session as the independent (i.e. “predictor”) variable:
# Simple linear model of Distance over Session: model_2 <- lm(Distance ~ Session, data = Running) summary(model_2)
This model can be visualized with a simple scatterplot. An OLS regression line was added to this plot demonstrate how well the model fits the data. As we can see in the plot below, there is clearly a non-zero (positive) slope to this model. However, there are also some clear patterns and bifurcations in this data that are not well accounted for by this model, providing a clear example of under-fitting:
gf_point(Distance ~ Session, data = Running)%>% gf_lm() %>% gf_labs(title = "Scatterplot of Distances Ran by Session", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Analysis of Variance (ANOVA)
The variable ‘Workout’ has five different categories: Easy, Intervals, Long, Segments, & Race. With these five factors, we can investigate continuous variables like Distance by employing the same linear approach. In this example, we have Distance separated by Workout type, using an Analysis of Variance (ANOVA):
# One Way ANOVA of Distance by Workout Type: model_3 <- lm(Distance ~ Workout, data = Running) # Summarize model: summary(model_3)
To visualize a One-Way ANOVA like this example, we can use box-plots or Violin plots. In the example below, we can see clear differences in distances ran by workout type, which is not surprising, given the structure of the running program:
# Boxplots of Distance by Workout, adding the 'Jitter' function: gf_boxplot(Distance ~ Workout, data = Running, fill = ~ Workout) %>% gf_labs(title = "Boxplots of Distances ran by Workout Type", subtitle = "Half Marathon Running Data") %>% gf_theme(legend.position = "none") + geom_jitter()
Since R is a vector based language, it works great with linear models, which are additive by nature. In the ANOVA example above, we saw some clear divisions in the data with respect to distances ran by workout type. Consequently, any model we may want to build with Distance as the dependent variable should include the Workout variable, in addition to the Session variable in the regression example. This provides a good illustration of the additive nature of linear models with the Distance as the dependent (i.e. “outcome”) variable, and both Session & Workout as the independent (i.e. “predictor) variables:
# Create Model of Distance over Session, by Workout: model_4 <- lm(Distance ~ Session + Workout, data = Running) # Summarize model: summary(model_4)
In the output above, we have an example of a multiple regression model with multiple slopes and intercepts, represented by the significance codes (i.e. “*”, “**”, & “***”) next to the predictor variables on the right. Visually, this final model is represented in the plot below, with Distance on the Y axis, Session on the X axis, and each Workout type represented by a color. Notice the clear example of multiple slopes and intercepts in this visual example:
# Scatterplot of Distances ran by Workout Type over Session: gf_point(Distance ~ Session, data = Running, color = ~ Workout) %>% gf_labs(title = "Distances Ran by Session & Workout Type", subtitle = "Half Marathon Running Data") %>% gf_theme(legend.position = "right") %>% gf_lm()
By now you should be able set up the R environment, load & view data, create basic statistical visualizations, and model data using a linear approach. With the code chunks provided, you should be able to adapt the code to look at these data however you see fit. Even better would be to branch out and analyze data of your choosing. Numerous datasets come standard in R, and don’t even need to be loaded. Some commonly used examples are the iris, cars, mtcars, diamonds, and titanic datasets. If you want to keep with the running theme, I have numerous datasets and analysis (with code examples) at the links below:
- Couch to 5K (C25K) data & analysis
- Faster5K data & analysis
- First Half Marathon data & analysis
- Couch to Half Marathon data & analysis