Up & Running with R

In previous posts, I have talked about the value of knowing a scripting language, like R, for statistical analysis. As an open sourced software, R allows you to do advanced statistical analysis and build robust models for prediction and analysis, in addition to being an excellent tool for data wrangling and data visualization. However, the biggest barrier for entrance for most people is learning the language itself, but that doesn’t need to be the case.

Scripting languages are often approached by learning the grammar of the language first through drills, before eventually getting to statistical analysis and visualization. What’s interesting to me about that approach is it’s not at all how people learn a language. When we learn to speak – or learn a new language in general- we typically do so through imitation, experimentation, and a whole lot of trial and error. Learning a scripting language is the same. This blog post aims to help users get up and running quickly in R with some simple code that can be adapted for statistical analysis. All charts and analysis can be replicated using the code chunks below or the raw code and data. If possible, I suggest an Interactive Development Environment (IDE) like RStudio.

The Basics

To get started, I have found it’s easiest to take a linear model approach, such as:

function( Y ~ X, data = DataSetName )

Below is a short description of each part in the pseudo-code above:

  • Y is the outcome of interest (response variable)
  • X is some explanatory variable (or you can use “1” as a placeholder if there is no explanatory variable)
  • DataSetName represent data loaded into the R environment

Note that an R function dictates something you want to do with your data.

Setting up your environment

Most people that use open source programs like Python or R will agree on one thing: the packages are what make them great. R, in its most basic format – which is often referred to as “Base R” – works like a really big calculator. Base R comes with a number of functions for statistical analysis and plotting, but since R is open source, there is a large community of creators who share packages. This expands the capability of the program tremendously.

There are two steps when you want to use a package:

  1. Install the package from the source
  2. Call the package for use in your program

The best analogy I have heard is if you think of an R package like a song you buy online. You buy each song one time to add them to your music library. If you want to make a playlist, you need to go to your music library and select the songs you want to hear. R works the same way, but with one major difference: everything is free. Below is how you would install and load the packages necessary for this analysis:

# Note: the hashtag character (#) designates user comments... Don't afraid to use them *very* generously to document your code. 

# Install Packages (you only have to do this once)
install.packages("tidyverse")
install.packages("mosaic")

# Load Packages
library(tidyverse)
library(mosaic)

The Data

For this post, I have included a dataset of from my GitHub account of two different half marathon training seasons: one from 2020 & one from 2021. I used the same running app for both of them, which had a consistent structure to the training plans, allowing for a number of comparisons to be made. For this post, we will keep things simple and focus only on these 4 variables to demonstrate a variety of functions, visualizations, and tests:

  • Attempt – Categorical variable indicating if a run was on the first or second attempt with the program.
  • Distance – Continuous variable representing the distance, measured in miles.
  • Workout – Categorical variable representing the four possible workout types: Easy, Intervals, Long, Race & Segments.
  • Session – Ordinal variable, numbered 1-46, representing which run (i.e “session) it was within the program. Note: given the large number of levels, we will treat this as a continuous variable for these examples.

Loading & Viewing Data

Thanks to the open source nature of this software, there are packages and functions for nearly every type of data file. In addition, there are some simple functions that allow you to inspect the data in a variety of ways. The best part is it doesn’t take too much coding:

# Data intake:

Running <- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv")

# View the data frame:

Running
Figure 1 – Printout of Running dataframe

In the database above, we can see that there are 92 rows (i.e. observations) on 13 variables. As long as you have loaded tidyverse into your IDE (i.e. RStudio), you should be able to scroll freely through the data interactively. Typically, we just want to peek at the data though, which you can do these functions:

  • names() – Shows the names of the headers (i.e. variable names)
  • head() – Shows the first several rows of a matrix or data frame
  • tail() – Shows the first several rows of a matrix or data frame
  • glimpse() – Transposed version of print, making it possible to see every column in a data frame.
  • str() – Transposed version of print, showing only the first few rows (i.e. similar to the head() function, but listed horizontally, instead of vertically).

More times than not, we need to do some manipulation of the data before we do any kind of analysis. If I am being honest, cleaning, joining, and reshaping data typically makes up about 80% of my time on data projects. R can be great for this, and one helpful function from the Tidyverse is the “pipe operator.” The pipe operator (%>%) allows you to think linearly and “pipe” data through different functions, almost endlessly. This can be really helpful with filtering, mutating, reshaping, and cleaning data.

Below is a simple example of the pipe operator in action; selecting only the four variables of interest: Attempt, Session, Workout, & Distance. From there, the head() function shows the first few rows of data, demonstrating how we have only those variables of interest:

# Select variables of interest and overwrite "Running" dataframe

Running <- Running %>% select(Attempt, Session, Workout, Distance)

## Note the "pipe operator (i.e. %>%) above. This is a great tool for "piping new

# view just the first few rows to confirm only the variables of interest
head(Running)
Figure 2 – Example of head() function showing variables of interest

Summarizing Data

Base R has a number of built in functions for summarizing data. These can come in handy when needing to make quick calculations, and work in the linear format we have referenced above. In this example we are calculating the mean of the Distance variable within the Running data frame:

mean(Distance ~ 1, data = Running)

We can also use this approach – note the $ symbol connecting the variable within the data frame – yielding the same result:

mean(Running$Distance)

Some other summary statistic functions that are built into base R include:

  • mean() – Calculates the arithmetic mean (i.e. “average”) of the column selected
  • median() – Calculates the median of the column selected
  • mode() -Determines the mode of the column selected
  • min() – Determines the minimum value of the column selected
  • max() – Determines the maximum value of the column selected
  • sd() – Calculates the standard deviation of the column selected
  • sum() – Calculates the sum of the column
  • range() – Distance between the highest and lowest data points of the column selected
  • iqr() – Displays the interquartile range (middle 50% of the data) of the column selected

Instead of using the options above, the Mosaic package contains a number of functions for computation, calculus, statistics, & modeling. For example, the favstats() function does a number of summary statistics, including a five number summary (min, first quartile, median, third quartile, & max), along with standard deviation, mean, number of missing observations, and total number of observations:

# Five Number Summary using favstats function:

favstats(Running$Distance)

# Same thing coded another way (consistent with the format of later examples):

favstats(Distance ~ 1, data = Running)
Figure 3 – example of favstats() function showing summary statistics

The favstats() function also allows you to summarize by groups. In this example, the same statistics are calculated for the Distance variable by the 5 different workout types:

# Favstats, separated by Workout type

favstats(Distance ~ Workout, data = Running)
Figure 4 – Example of favstats() function by group

Plotting Basics

One of R’s greatest advantages is its ability to create customized visualizations. When you incorporate the pipe operator, you can write in layers, adding more detail with each one. You simply start with the kind of chart you want to use, like:

  • gf_histogram() – Plots a histogram
  • gf_density() – Generates a density plot
  • gf_boxplot() – Creates a boxplot
  • gf_violin() – Generates a violin plot
  • gf_point() – Creates a scatterplot

Then “pipe” the customizations you want to add:

  • gf_labs() – Adds labels to the plot
  • gf_theme() – Allows user customize layout & themes
  • gf_lm() – Adds an Ordinary Least Squares (OLS) line to the plot
  • gf_smooth() – Adds a smoothing function to the OLS line to account for curvature in the data.
  • geom_jitter() – Add noise to a numeric vector to remove overlaps (ie.”to break ties”)

Below are some examples of basic data visualizations using the ggformula functions listed above. Notice with each chart how these charts become increasingly customized by using the functions above:

Histogram
# Plot histogram of Distance variable:

gf_histogram(~Distance, data = Running)
Plot 1 – Histogram of Distance variable
Density Plot
# Density plot of Distance variable, adding a title:

gf_density(~Distance, data = Running) %>%
gf_labs(title = "Distances Ran")
Plot 2 – Density plot of Distance variable
Boxplots
# Boxplot of Distance by Attempt; adding subtitle & caption

gf_boxplot(Distance ~ Attempt, data = Running) %>%
  gf_labs(title = "Boxplots of Distance by Workout", subtitle = "Half Marathon Running Data", caption = "Up & Running in R")
Plot 3- Boxplot of Distance by Attempt

Basic Statistical Modeling

Now that we have the basic linear approach to coding in R, we can pair visualizing with modeling to gain a clearer picture of the data. Below are some examples of some basic statistical tests with the variables from the Running dataset.

Two Sample T-Test

The variable ‘Attempt’ refers to whether or not the run was on the first or second attempt at the program. The running plan was based on time intervals, not mileage, so looking at distance between attempts would create a logical comparison. Given that ‘Distance’ is a continuous variable and Attempt is a categorical variable with two levels, we can evaluate this model using the t.test() function.

# Two Sample T-test between Distance and Attempt
t.test(Distance ~ Attempt, data = Running)
Figure 5 – Example of t.test() function

Note – Since the linear approach is an additive model, it is possible to collapse it all the way down to a t.test, giving us the same results:

# Same test, but using a linear model approach, yielding the same result:

model_1 <- lm(Distance ~ Attempt, data = Running)

summary(model_1)
Figure 6 – Example of T-test using a linear approach

Visualizing these variables can be done with box plots, or through violin plots like in the chart below. Violin plots work like a hybrid of a box plot and a kernel density plot, showing the peaks and valleys in the data:

# Violin plot of Distance by Attempt; adding caption
gf_violin(Distance ~ Attempt, data = Running) %>%
  gf_labs(title = "Violin Plots of Distances Ran", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Plot 4 – Violin plot Distance by Attempt

Simple Linear Regression

To investigate distances ran over time, we can plot (and test) these using an Ordinary Least Squares (OLS) model (i.e. “regression”). In this example, we have Distance as the dependent (i.e. “outcome”) variable and Session as the independent (i.e. “predictor”) variable:

# Simple linear model of Distance over Session:
model_2 <- lm(Distance ~ Session, data = Running)

summary(model_2)
Figure 7 – Example of a simple linear (i.e. Ordinary Least Squares) regression

This model can be visualized with a simple scatterplot. An OLS regression line was added to this plot demonstrate how well the model fits the data. As we can see in the plot below, there is clearly a non-zero (positive) slope to this model. However, there are also some clear patterns and bifurcations in this data that are not well accounted for by this model, providing a clear example of under-fitting:

gf_point(Distance ~ Session, data = Running)%>%
gf_lm() %>%
  gf_labs(title = "Scatterplot of Distances Ran by Session", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Plot 5 – Scatterplot of Distance over Session

Analysis of Variance (ANOVA)

The variable ‘Workout’ has five different categories: Easy, Intervals, Long, Segments, & Race. With these five factors, we can investigate continuous variables like Distance by employing the same linear approach. In this example, we have Distance separated by Workout type, using an Analysis of Variance (ANOVA):

# One Way ANOVA of Distance by Workout Type:
model_3 <- lm(Distance ~ Workout, data = Running)

# Summarize model:
summary(model_3)
Figure 8 – Example of One Way ANOVA using linear approach

To visualize a One-Way ANOVA like this example, we can use box-plots or Violin plots. In the example below, we can see clear differences in distances ran by workout type, which is not surprising, given the structure of the running program:

# Boxplots of Distance by Workout, adding the 'Jitter' function:

gf_boxplot(Distance ~ Workout, data = Running, fill = ~ Workout) %>%
  gf_labs(title = "Boxplots of Distances ran by Workout Type", subtitle = "Half Marathon Running Data") %>%
  gf_theme(legend.position = "none") +
  geom_jitter()
Plot 6 – Boxplots of Distance by Workout

Multivariate Modeling

Since R is a vector based language, it works great with linear models, which are additive by nature. In the ANOVA example above, we saw some clear divisions in the data with respect to distances ran by workout type. Consequently, any model we may want to build with Distance as the dependent variable should include the Workout variable, in addition to the Session variable in the regression example. This provides a good illustration of the additive nature of linear models with the Distance as the dependent (i.e. “outcome”) variable, and both Session & Workout as the independent (i.e. “predictor) variables:

# Create Model of Distance over Session, by Workout:

model_4 <- lm(Distance ~ Session + Workout, data = Running)

# Summarize model:

summary(model_4)
Figure 9 – Example of multiple regression model

In the output above, we have an example of a multiple regression model with multiple slopes and intercepts, represented by the significance codes (i.e. “*”, “**”, & “***”) next to the predictor variables on the right. Visually, this final model is represented in the plot below, with Distance on the Y axis, Session on the X axis, and each Workout type represented by a color. Notice the clear example of multiple slopes and intercepts in this visual example:

# Scatterplot of Distances ran by Workout Type over Session:

gf_point(Distance ~ Session, data = Running, color = ~ Workout) %>%
  gf_labs(title = "Distances Ran by Session & Workout Type", subtitle = "Half Marathon Running Data") %>%
  gf_theme(legend.position = "right") %>%
  gf_lm()
Plot 7 – Scatterplot of Distance by Session & Workout

Final Thoughts

By now you should be able set up the R environment, load & view data, create basic statistical visualizations, and model data using a linear approach. With the code chunks provided, you should be able to adapt the code to look at these data however you see fit. Even better would be to branch out and analyze data of your choosing. Numerous datasets come standard in R, and don’t even need to be loaded. Some commonly used examples are the iris, cars, mtcars, diamonds, and titanic datasets. If you want to keep with the running theme, I have numerous datasets and analysis (with code examples) at the links below:

Thanks for reading!

From Couch to Half Marathon

In the fall of 2020 I set out to be more active and took up running as a hobby. Right as I completed the Couch to 5K Program (C25K), lockdowns were being implemented across the country and I found myself with a lot more time on my hands. So, I set out to improve speed next by getting my 5K time to under 30 minutes before shifting my focus to running my first ever half marathon. This blog post hopes to take you on the journey with the data I collected along the way.

Going from couch to half marathon took me through three different running plans, using two different iPhone apps. The first running plan I used was the Couch to 5K Program (C25K); a standalone app and plan created by Active. To improve speed, I used the “Tempo Run: 5k” training plan, followed by distance using the “Half Marathon Goal” plan, both found within the RunTracker Pro app. Each of these apps had simple to follow prompts telling you when to run, walk, or pick up the pace, and are designed to progressively build speed and endurance over time.

The C25K running training plan utilizes the run / walk method and includes 3 runs per week – each between 20 and 30 minutes – with the program lasting 9 weeks in total. Over the course of the 27 training runs, the proportion of walking decreases while the proportion of running increases, culminating with three 30 minute runs in the last week of the program. The “Tempo Run: 5k” plan consisted of three runs per week for a total of eight weeks, with the same structure each week: an interval run, a tempo run, and a base run. Similar to the C25K plan, runs progressively increase in both mileage and intensity throughout. Finally, the “Half Marathon Goal” running plan consisted of four runs per week – a base run, an interval run, a tempo run, and a long run – for a total of twelve weeks. In this plan, each week ends with a long, slow distance (LSD) run, culminating in a final run of 2 hours and 15 minutes in the last week of the program. In the graphs below, we see great representations of both normal (bottom) and positively skewed (top) distributions when we look at speed and distances ran throughout these programs:

Overall Distribution of Running Distances & Paces

Given that each program had different goals, we see some clear distinctions between each of them. Unsurprisingly, the Half Marathon program featured the longest runs and the largest spread (i.e. variance) with respect to distance, but the least amount of variability with respect to speed. Another expected result was the with Tempo Run: 5K program, which featured the fastest runs with the least amount of variability in distance throughout the program. These results are clearly represented in the box plots below:

Distribution of Running Distances and Paces by Program

Since there was an ordered component to these programs, the best way to view these data is through a scatter plot, which allows us to vizualize progress over time. We can see that running pace improved at a significantly greater rate in the C25K & Faster 5K program when compared to the Half Marathon plan, which makes sense, given their respective goals. This also explains the curvature in the data when looking at running pace. When investigating distance, we see that most runs stayed within 2 to 4 miles throughout each program, with the exception of the long weekend runs in the Half Marathon plan, which clearly separate themselves from the pack linearly over time:

Scatter Plots of Running Distances & Paces over Time

Final Thoughts

While I initially did not set out to go from Couch to Half Marathon, that is what ended up happening, thanks to a few inexpensive running apps and some extra time on my hands due to a global pandemic. The C25K app is a great resource for anyone who is looking to get into running. Employing the run/walk method, the program consists of 27 runs, spread out over 9 weeks. To run faster I completed the Tempo Run: 5K (ie. Faster 5k) plan, before tackling the Half Marathon Goal plan, both of which were subsumed with the Runtracker Pro App. Both of these apps are inexpensive and helpful resources for those who are interested in getting into, or improving their running.

One word of caution: Many people who have completed this program inculcate that you should not be afraid to add extra rest days or repeat workouts as needed. I would agree with that. More importantly, you absolutely should not skip ahead, nor should run on back to back days in the beginning. The quickest way to halt any progress is through injury, so take your time and enjoy the run!

Below are links to posts breaking down each of the programs individually, along with the raw data and code used to create the charts and analyis.

Thanks for reading!

Couch to 5K

Faster 5K

Half Marathon Goal


# clean up (this clears out the previous environment)
ls()

# Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)
library(hrbrthemes)
library(viridis)

# Likert Data Packages
library(psych)
library(FSA)
library(lattice)
library(boot)
library(likert)

#install.packages("wordcloud")
library(wordcloud)
library(tm)
library(wordcloud)


# Grid Extra for Multiplots
library("gridExtra")

# Multiple plot function (just copy paste code)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}


# Couch to Half

# Import data from CSV, no factors

Couch2Half <- read.csv("Couch2Half.csv", stringsAsFactors = FALSE)

Couch2Half <- Couch2Half %>%
  na.omit()

Couch2Half

Couch2Half %>% 
  count(Program)

ggplot(Couch2Half, aes(x = Program, fill = Program)) +
  geom_bar() + 
  labs( x ="", y = "Speed (Miles per Hour)", title = "Runs by Program",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))

# Plot 1 - Density Plot of Running Distances

p1 <- ggplot(Couch2Half, aes(x=Distance)) + 
  geom_density(color="#E69F00", fill="#999999") + labs( x ="Distance (Miles)", y = "", title = "Running Distances",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

# Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(Couch2Half, aes(x=Pace_MPH)) + 
  geom_density(color="#E69F00", fill="#56B4E9") + 
  labs( x ="Pace (Miles per Hour)", y = "", title = "Running Paces",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

# Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)


# Plot
p3 <- Couch2Half %>%
  ggplot( aes(x=Program, y= Distance, fill=Program)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Distance (Miles)", title = "Distance by Workout",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))
  

# Plot
p4 <- Couch2Half %>%
  ggplot( aes(x=Program, y= Pace_MPH, fill=Program)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Speed (Miles per Hour)", title = "Speed by Workout",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))


# Combine plots using multi-plot function
multiplot( p3, p4, cols=2)


p5 <- ggplot(Couch2Half, aes(x=Run, y= Pace_MPH, color = Program)) + geom_point() +  geom_smooth(method=lm , color="Black", se=TRUE) + labs( x ="Training Session", y = "Pace (Miles per Hour)", title = "Running Pace",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank()) + scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))



p6<- ggplot(Couch2Half, aes(x=Run, y= Distance, color = Program)) + geom_point() +  geom_smooth(method=lm , color="Black", se=TRUE) + labs( x ="Training Session", y = "Distance (Miles)", title = "Running Distance",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank()) + scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))

# Combine plots using multi-plot function:

multiplot( p5, p6, cols=1)


# Summary Statistics of Distance
favstats(Couch2Half$Distance)

# Summary Statistics of Pace
favstats(Couch2Half$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(Couch2Half$Session, Couch2Half$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(Couch2Half$Session, Couch2Half$Pace_MPH, method = "pearson")

Running Through the Data: C25K

I am not typically one for New Years resolutions, but in 2020 I made a really small one: keeping better track of my activity with my Apple Watch. I thought that simply tracking it, I might improve my overall activity level. I was pretty sedentary at the beginning, but after a few weeks I saw a noticeable improvement activity level, and I felt better. So, I decided to see if I could raise the bar a bit more by completing the Couch to 5K program, using the C25K  running app.

c25K Training App by Active

The C25K running app is based on Josh Clark’s running programs which scaffolds participants through a series of manageable expectations. The training plan includes 3 runs per week – each between 20 and 30 minutes – with the program lasting 9 weeks in total. The most noticeable feature of the training plan is it combines both running and walking. Over the course of the 27 training runs, the proportion of walking decreases while the proportion of running increases, culminating with three 30 minute runs in the last week of the program. 

C25K Training Plan Example (Week 1, Day 1)

This was my second time using the C25K program. My first time trying the app, I completed it, generally enjoyed it, and even ran a few 5K’s afterwards. However, I ended up getting hurt / burned out, and within a few years was definitely back to square one. This time, I made my primary goal to stay injury and pain free, so I focused more closely on listening to my body, slowing down, and taking rest as needed. Since I am still running today, I decided to take a look back at those training runs and share the data with anyone who is interested.

Speed, Distance, & Progress

The two most obvious variables to look at were speed and distance. The distances ran throughout this program ranged from 2.01 to 3.74 miles, with an average of 2.65 miles per run. Running speed ranged from 4.02 (14:55 min/mile) to 5.51 mph (10:53  min/ mile), with an average of 4.79 mph ( 12:31 min/mile). The distributions of my runs by distance and speed for the C25K program can be seen in the density plots below:

Distances & Speeds Ran in C25K Training Plan (Fall, 2020)

Since people are generally more interested in seeing progress, below are scatter plots of distances covered and running speeds over the course of the 27 training runs. At first glance, we see a strong positive association with training volume (mileage) and intensity (speed) throughout the duration of the training program. When you take a closer look at both scatter plots, you see clear cycles ebbing and flowing along the positive slopes. Most training plans are designed to take on this kind of shape, so neither of these results are surprising:

Running Progress (Distance & Speed), Fall 2020

Looking back at the data two years removed, a number of interesting things stand out to me. The first one is how tightly packed, and predictable, the data is. Both speed and distance remain very similar in adjacent runs. This is how the program is designed, and completely makes sense when developing a fitness base. However, most of the training I do now is very different than that. Speed and distance vary widely from run to run, to allow for different kinds of stress and recovery. The second novel finding was how strong the slope was for both variables. When first starting out, the good news is you are probably going to improve very quickly – although it may not feel like it at the time. The longer you run, the rate of improvement slows down considerably. Most of my work now as a runner is built on slow, gradual gains; so improvement like this over this short of a period would put me at risk of injury now. The key difference for me now is I can run much further distances and have a much higher top speed, but the rate of progress is far less noticeable.

Final Thoughts

With millions of downloads, the C25K app has consistently been one of the most popular training apps for new runners, and for good reason. Based on a series of running plans developed by Josh Clark’s in the 90’s, the C25K training plans are structured to build runners up slowly, using a run / walk method. Whenever I talk with people who are interested in starting a running routine, one of the first things I recommend is they get this app, primarily because it employs the run / walk method. Many people think running should not include walking, or that walking is cheating or a sign weakness. Objectively, it is not. The longer you run, the more important it is that you find your ideal pace, in order to keep your heart down and breathing under control. The run / walk method accomplishes this by slowly increasing the proportion of running to walking over time. Also, you would be freaked out by how fast and how far some people who use the run walk method are.

A couple of words of caution about the program though. First and foremost, no one training app is going to fit everyone. Depending on current level of fitness and variety of other factors, the training program may take longer than 9 weeks. One of the most consistent pieces of advice you will find on the C25K program is that you should not be afraid to repeat runs, repeat weeks, or add extra rest if your body needs it. I couldn’t agree with this more. There are a few times when the increase in running volume felt like a lot (week 5, for example), so don’t be scared to slow down or add some extra rest. Definitely don’t skip ahead or run back to back days. The app is built so you will get faster and you will run further as you progress through the program. That’s baked in, but none of that will matter if you get hurt. Increasing speed or volume too quickly is the faster way to injury, but if you listen to your body and aren’t afraid slow down (i.e. walk more), then C25K should work great. 

Below are some links to C25K reviews, along with the raw data and code used to create the charts and analysis. For my next post, I plan to break down the data for the Faster 5k Training Plan that I used to shave a few minutes off my 5k time by introducing speed work.

Thanks for reading!  

Resources & Code:

C25K Running Data can be found here. The code I used (in R) to create plots and analysis is below:

# FRONT MATTTER

### All packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

## Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)

## Grid Extra for Multiplots
library("gridExtra")

## Multiple plot function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

# COUCH TO 5K

# Data Intake

C25K<- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv")

C25K

## Plot 1 - Density Plot of Running Distances

p1 <- ggplot(C25K, aes(x=Distance)) + 
  geom_density(color="Green", fill="Purple") + labs( x ="Distance (Miles)", y = "", title = "Distribution of Running Distances",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(C25K, aes(x=Pace_MPH)) + 
  geom_density(color="Purple", fill="Green") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Distribution of Running Speeds",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

## Plot 3 - Density Plot of of Running Distance over Time

p3 <- ggplot(C25K, aes(x=Session, y= Distance)) + geom_point(color="blue") +  geom_smooth(method=lm , color="red", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Progression of Running Distance",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 4 - Density Plot of of Running Speed over Time

p4<- ggplot(C25K, aes(x=Session, y= Pace_MPH)) + geom_point(color="red") +  geom_smooth(method=lm , color="blue", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Progression of Running Speed",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

## Summary Statistics of Distance
favstats(C25K$Distance)

# Summary Statistics of Pace
favstats(C25K$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(C25K$Session, C25K$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(C25K$Session, C25K$Pace_MPH, method = "pearson")

# Simple Linear Model of Pace & Session
Distance <- lm(Distance ~ Session, data =C25K)
summary(Distance)

# Simple Linear Model of Pace & Session
Speed <- lm(Pace_MPH ~ Session, data =C25K)
summary(Speed)