It’s Time to Ditch SPSS

If you were trained in a social science, there is a good chance that you have had to use, or still use SPSS. For most social science research, SPSS is a powerful program and arguably the industry standard in academia, especially in the social sciences. Statistical Package for Social Sciences (SPSS) is a menu driven software that is easy enough to learn – as long as teachers provide assignments and tutorials – to get you analyzing data quickly. In other words, SPSS is a good program for statistical analysis, and the barrier for entry is not too high, making it a great fit for social scientists who are just learning statistics. However, SPSS does have its limitations and there are options out there that are not only more robust, but completely open source (i.e. free). 

Photo by Kevin Ku on Unsplash
Open Course Software:

One of the advantages of menu-driven statistical software has been the ease of use. Fortunately, many of the open source options – like Python and R – have come a long way towards lowering the learning curve in recent years. Graphical User Interfaces (GUIs), such as RStudio & Jupyter Notebooks (among many others) have made these programs much easier to learn, and come with many features unavailable in menu-driven programs like SPSS. In addition, when working on projects with Python or R, you can save them in a way that allows you to replicate, expand and share your analysis easily. Finally, what really sets languages like R and Python apart – besides being free – is that there a number of packages available from a community of people who like to share and help answer problems. This culture of learning and sharing has created options to do advanced statistical analysis and modeling – including machine learning – in addition to highly customizable data visualizations and tools for extracting, transforming, and loading (ETL) to create efficient data pipelines.

Transparency:

The ability to be transparent in your data analysis cannot be stressed enough. In statistics, it is easy to put a variable in the wrong place or select the wrong test, but still get results that are statistically significant (and wrong). Typically the people who are reading your analysis don’t get to see your raw data and the processes you went through with your analysis. Black box programs, like SPSS, only compound this. With language based programs like R and Python, you show every aspect of your work along the way; providing greater credibility in the process. Letting other researchers see how you reached your conclusion – or at least giving them the ability to – only strengthens your research and analysis. 

Demand:

One more advantage to learning a code based program like Python or R is that those skills are in demand. If you are a graduate student or just an academic that is looking for jobs outside of the academy, having skills in data science are very marketable. Industries everywhere are wanting to gain greater insights into their sales, operations, and consumer base. In addition, the federal government and many academic institutions need people with skills to handle the large databases that aren’t as easily compatible with many of the menu driven software programs like SPSS. 

What about SAS and Stata?

Both SAS and Stata are statistical software packages that rely on a code based language. These programs are not free, but they are very powerful software programs that offer a lot of options, and work with large datasets. For years, SAS has been the preferred software for manufacturing and the medical field while Stata is often used in the political science and survey field. Consequently, some organizations simply use it because they don’t want to take the time and money to train their employees in other languages, not to mention the legacy code they have relied on for years, if not decades. The big drawback of these two programs, besides cost, is you can’t easily access the packages available in open source software, like Python or R.

Versatility of Python and R

If you know how to program in Python or R, it much easier to switch over to other languages. To effectively code in Python or R, you have to clearly understand your variables and the algorithm. In other words, you have to tell the software exactly what to do. Once you get past the basics, these languages are not difficult to understand. The problem for most people though is getting past the basics, which just takes some patience, and honestly a lot of trial and error. Once you are comfortable summarizing, analyzing and visualizing data with language based software, you’ll have a deeper understanding of what the data is actually doing. More importantly, you can easily transfer those skills to other programs. 

Personally, I began on SPSS. It did more than what I needed for my classes in behavioral statistics. When I took a couple of applied statistics course we used Minitab, in part because it was developed and is still housed at Penn State (so it was free to us). Those two applied stats classes didn’t care what software you used, but they had clear tutorials that went along with our free access to Minitab, so I generally used that. Eventually, I started taking even more advanced stat classes which worked exclusively in R. We were required to turn all homework in through Markdown, knitting the document to html. We got up and running quickly with the Mosaic package and were able to do some interesting analysis and date visualization right away. After I completed my graduate certificate in Applied Statistics, I just kept on learning how to code better and analyze data in R. Once I had solid foundations with R, I was able to learn other languages like SQL, SAS, & Python fairly easily (when needed), because the conceptual foundation had already been set. As a result, I ended up being competitive for numerous jobs outside of academia in both government and industry.

Which option to choose?

This choice all depends on what kinds of work you see yourself doing. Personally I prefer R, because it works like a really fancy calculator, making it great for modeling. Also, there are great packages for visualizing like GGPlot, in addition to packages like dplyR for extracting, transforming, and loading (ETL) data. With that being said, I have worked at places where everyone uses SAS, so I did too. Finally, most positions in data science & statistics now typically list Python (in addition to SQL) as necessary languages. So if I were just starting out, that is probably where I would invest my time. Relatedly, there is one software program you will almost never find in data job postings: SPSS.

Thanks for reading!  

Running Through the Data: Half Marathon Goal by RunTracker

In the later half of 2020, I set a new goal for myself: run 13.1 miles by the end of the year. Earlier in the year I had completed the couch to 5k program and later set the goal to improve my time to under 30 minutes. Given the extra time at home thanks to a Global Pandemic, I set my sights on the half marathon distance. Since I was already familiar with the RunTracker app, I decided to stick with that and used their “Half Marathon Goal” training plan.

The Runtracker app, made by the Fitness 22 company, features a series of running plans tailored to individuals’ current fitness levels and goals. The “Half Marathon Goal” running plan consisted of four runs per week for a total of twelve weeks, with a consistent structure throughout most of the program. After a series of base runs in the first week, the next ten weeks featured a base run on Tuesdays, segments on Thursdays, intervals on Fridays, and long run on Sundays. Duration of workouts increase steadily over the course of the first ten weeks before tapering in the final two weeks of the program.

My experience with this running plan was great once I got used to the structure. Previously, the most I had run was three days a week, while this program requires four. This means there would be runs on consecutive days, which I was not used to. Having just finished a training plan geared towards speed work, I quickly learned I would need to slow down if I was going to keep from getting inured. Once I got settled into the format, mileage built progressively and speed eventually followed. By the end of the twelve-week program, I was able to confidently run 13.1 miles using my usual training route, which coincidentally looked like a shoe:

Distance & Pace

Since my goal was to complete a half marathon, the primary variable of interest was obviously distance. Like most runners, I also tend to focus on times, so average running pace served as the secondary variable of interest. Distances ran throughout the training program ranged from 2.14 to 13.12 miles per run, with a mean of 4.85 miles per run. Running paces ranged from 5.16 to 6.1 miles per run (11:38 to 9:50 min/mile ), with a mean of 5.54 miles per hour ( 10:50 min / mile). The distributions of my runs by distance and speed for this program can be seen in the density plots below:

Comparing Workouts

When taking a closer look at these distributions by workout type, we can see some clear patterns in the data. Distances for base runs, interval sessions, and segments, remained relatively close to one another, ranging from 2.14 to 6.02 miles per run. The long runs on Sundays though lived up to their name, ranging from 5.7 to 13.12, with an average of 9.16. Running pace for all workout types were somewhat consistent between groups, with each workout type averaging between 5.5 and 5.6 miles per hour. Distributions by workout type for distance and pace can be seen in the box plots below:

Training Progress

Given that there is an ordered component to training, we can look at these data linearly (i.e. regression). Below are scatter plots of distances covered and running speeds over the course of the 46 training runs in the program. We see a slightly positive association with trainings volume (mileage), while intensity (pace) remained relatable stable throughout the training program. When you take a closer look at the distance plot, we can see how the majority of volume is gained in training through the long runs on weekends, which is typical of most long distance training programs:

Cadence & Heart Rate

Two important considerations for runners are heart rate and cadence. When runners let their heart rates get too high, they tire much quicker. So, distance runners constantly work to keep their heart rate down while still running quickly. This can be aided by increasing cadence to the rate of approximately 180 beats per minute. Increasing cadence allows runners to develop better efficiency in their technique – typically by shortening the stride – which over time can lead to a lower heart rate. This translates into better performance with respect to both speed and endurance. In the plot below we can see that both cadence and heart rate are positively associated with running pace, with a clear interaction between these two variables as speed increases, represented by the slopes crossing one another:

Final Thoughts

The “Half Marathon Goal” plan on the RunTracker app is geared towards regular runners who are ready to tackle the 13.1 distance. The training structure consists of three runs per week with a base run, a session of mile repeats, an interval session, and one long run on the weekend. The variety of workouts in the program are designed primarily to build the strength and endurance to run a half marathon, with some speed work included to build anaerobic capacity as well. For anyone who has been running for a while and is ready to tackle longer distances, this program could be an excellent option.

Below are some links related on running a first half marathon, along with the raw data and code used to create the charts and analysis.

Thanks for reading!

Resources & Code

# FRONT MATTTER

### Note: The HM_1.xlxs file will need to be converted to HM_1.csv to read in correctly. Also, all packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

## clean up (this clears out the previous environment)
ls()

## Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)
library(hrbrthemes)
library(viridis)

## Likert Data Packages
library(psych)
library(FSA)
library(lattice)
library(boot)
library(likert)

## Grid Extra for Multiplots
library("gridExtra")

## Multiple plot function (just copy paste code)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}


# HALF MARATHON GOAL by RUNTRACKER

## Import data from CSV, no factors

HM_1 <- read.csv("HM_1.csv", stringsAsFactors = FALSE)

HM_1 <- HM_1  %>%
  na.omit()

HM_1 


## Plot 1

p1 <- ggplot(HM_1 , aes(x=Distance)) + 
  geom_density(color="Pink", fill="Pink") + labs( x ="Distance (Miles)", y = "", title = "Running Distances",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())


## Plot 2

p2 <- ggplot(HM_1, aes(x=Pace_MPH)) + 
  geom_density(color="light blue", fill="light blue") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Running Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())


## Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

## Plot 3

p3 <- ggplot(HM_1 , aes(x= Session, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Red", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Running Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 4

p4<- ggplot(HM_1 , aes(x=Session, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Blue", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Running Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

## Summary Statistics of Distance
favstats(HM_1$Distance)

## Summary Statistics of Pace
favstats(HM_1$Pace_MPH)



## Pearson Product Correlation of Distance over Time (session)
cor.test(HM_1$Session, HM_1$Distance, method = "pearson")

## Pearson Product Correlation of Pace over Time (session)
cor.test(HM_1$Session, HM_1$Pace_MPH, method = "pearson")


## Plot
p5 <-  HM_1 %>%
  filter(Workout != "Race") %>%
  ggplot( aes(x=Workout, y= Distance, fill=Workout)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Workout Type", y = "Distance (Miles)", title = "Comparing Distances",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Reds")
  
## Plot
p6  <-  HM_1 %>%
  filter(Workout != "Race") %>%
  ggplot( aes(x=Workout, y= Pace_MPH, fill=Workout)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Workout Type", y = "Speed (Miles per Hour)", title = "Comparing Paces",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Blues")

## Combine plots using multi-plot function
multiplot( p5, p6, cols=2)

## Plot 7

p7 <- ggplot(HM_1 , aes(x= Cadence, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Red", se=TRUE) + labs(x ="Average Running Cadence", y = "Distance (Miles)", title = "Cadence by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())


## Plot 8

p8<- ggplot(HM_1 , aes(x=Cadence, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Green", se=TRUE) + labs( x ="Average Running Cadence", y = "Speed (Miles per Hour)", title = "Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())


## Plot 9

p9 <- ggplot(HM_1 , aes(x= Avg_Heart_Rate, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Blue", se=TRUE) + labs(x ="Average Heart Rate", y = "Distance (Miles)", title = "Heart Rate by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 10

p10<- ggplot(HM_1 , aes(x=Avg_Heart_Rate, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Purple", se=TRUE) + labs( x ="Average Heart Rate", y = "Speed (Miles per Hour)", title = "Heart Rate by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p7, p8, p9, p10, cols=2)

## Pivot data from wide to long for next chart

HM_1A <- gather(HM_1, Measurement, BPM, Cadence, Avg_Heart_Rate)

HM_1A

## Plot 11

p11<- ggplot(HM_1A , aes(x=Pace_MPH, y= BPM, Color= Measurement)) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs(x ="Average Pace (Miles per Hour)", y = "Beats per Minute", title = "Heart Rate & Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p11

## Plot 12

p12<- ggplot(HM_1A , aes(x=Distance, y= BPM, Color= Measurement)) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs( x ="Average Distance in Miles", y = "Beats per Minute", title = "Heart Rate & Cadence by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p12

# Combine plots using multi-plot function
multiplot( p11, p12, cols=1)



## Plot 13
p13 <- ggplot(HM_1A , aes(x = Pace_MPH, y = BPM, color = Measurement) ) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs(x ="Average Pace (Miles per Hour)", y = "Beats per Minute", title = "Heart Rate & Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"))

Running Through the Data: C25K

I am not typically one for New Years resolutions, but in 2020 I made a really small one: keeping better track of my activity (using my Apple Watch). I thought that by simply measuring my activity, it might result in an increase overall. I was pretty sedentary at the beginning, but after a few weeks of tracking I saw a noticeable improvement in activity level, and I felt better. So, I decided to see if I could raise the bar a bit by completing the Couch to 5K program, using the C25K  running app.

c25K Training App by Active

The C25K running app is based on Josh Clark’s running programs, which scaffolds participants through a series of manageable expectations. The training plan includes 3 runs per week – each between 20 and 30 minutes – with the program lasting 9 weeks in total. The most noticeable feature of the training plan is it combines both running and walking. Over the course of the 27 training runs, the proportion of walking decreases while the proportion of running increases, culminating with three 30 minute runs in the last week of the program. 

C25K Training Plan Example (Week 1, Day 1)

This was my second time using the C25K program. My first time trying the app, I completed it, generally enjoyed it, and even ran a few 5K’s afterwards. However, I ended up getting hurt / burned out, and within a few years was definitely back to square one. This time, I made my primary goal to stay injury and pain free, so I focused more closely on listening to my body, slowing down, and taking rest as needed. Since I am still running today, I decided to take a look back at those training runs and share the data with anyone who is interested:

Speed, Distance, & Progress

The two most obvious variables to look at were speed and distance. The distances ran throughout this program ranged from 2.01 to 3.74 miles, with an average of 2.65 miles per run. Running speed ranged from 4.02 (14:55 min/mile) to 5.51 mph (10:53  min/ mile), with an average of 4.79 mph ( 12:31 min/mile). The distributions of my runs by distance and speed for the C25K program can be seen in the density plots below:

Distances & Speeds Ran in C25K Training Plan (Fall, 2020)

Since people are generally more interested in seeing progress, below are scatter plots of distances covered and running speeds over the course of the 27 training runs. At first glance, we see a strong positive association with training volume (mileage) and intensity (speed) throughout the duration of the training program. When you take a closer look at both scatter plots, you see clear cycles ebbing and flowing along the positive slopes. Most training plans are designed to take on this kind of shape, so neither of these results are surprising:

Running Progress (Distance & Speed), Fall 2020

Looking back at the data two years removed, a number of interesting things stand out to me. The first one is how tightly packed, and predictable, the data is. Both speed and distance remain very similar in adjacent runs. This is how the program is designed, and completely makes sense when developing a fitness base. However, most of the training I do now is very different than that. Speed and distance vary widely from run to run, to allow for different kinds of stress and recovery. The second novel finding was how strong the slope was for both variables. When first starting out, the good news is you are probably going to improve very quickly – although it may not feel like it at the time. The longer you run, the rate of improvement slows down considerably. Most of my work now as a runner is built on slow, gradual gains; so improvement like this over this short of a period would put me at risk of injury now. The key difference for me now is I can run much further distances and have a much higher top speed, but the rate of progress is far less noticeable.

Final Thoughts

With millions of downloads, the C25K app has consistently been one of the most popular training apps for new runners, and for good reason. Based on a series of running plans developed by Josh Clark’s in the 90’s, the C25K training plans are structured to build runners up slowly, using a run / walk method. Whenever I talk with people who are interested in starting a running routine, one of the first things I recommend is they get this app, primarily because it employs the run / walk method. Many people think running should not include walking, or that walking is cheating or a sign weakness. Objectively, it is not. The longer you run, the more important it is that you find your ideal pace, in order to keep your heart down, breathing under control, and good running form. The run / walk method accomplishes this by slowly increasing the proportion of running to walking over time. Also, you would be freaked out by how fast and how far some people who use the run walk method are.

A couple of words of caution about the program though. First and foremost, no one training app is going to fit everyone. Depending on current level of fitness and variety of other factors, the training program may take longer than 9 weeks. One of the most consistent pieces of advice you will find on the C25K program is that you should not be afraid to repeat runs, repeat weeks, or add extra rest if your body needs it. I couldn’t agree with this more. There are a few times when the increase in running volume felt like a lot (week 5, for example), so don’t be scared to slow down or add some extra rest. Definitely don’t skip ahead or run back to back days. The app is built so you will get faster and you will run further as you progress through the program. That’s baked in, but none of that will matter if you get hurt. Increasing speed or volume too quickly is the faster way to injury, but if you listen to your body and aren’t afraid slow down (i.e. walk more), then C25K could be a great way to get started. 

Below are some links to C25K reviews, along with the raw data and code used to create the charts and analysis. For my next post, I plan to break down the data for the Faster 5k Training Plan that I used to shave a few minutes off my 5k time by introducing speed work.

Thanks for reading!  

Resources & Code:

C25K Running Data can be found here. The code I used (in R) to create plots and analysis is below:

# FRONT MATTTER

### All packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

## Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)

## Grid Extra for Multiplots
library("gridExtra")

## Multiple plot function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

# COUCH TO 5K

# Data Intake

C25K<- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv")

C25K

## Plot 1 - Density Plot of Running Distances

p1 <- ggplot(C25K, aes(x=Distance)) + 
  geom_density(color="Green", fill="Purple") + labs( x ="Distance (Miles)", y = "", title = "Distribution of Running Distances",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(C25K, aes(x=Pace_MPH)) + 
  geom_density(color="Purple", fill="Green") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Distribution of Running Speeds",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

## Plot 3 - Density Plot of of Running Distance over Time

p3 <- ggplot(C25K, aes(x=Session, y= Distance)) + geom_point(color="blue") +  geom_smooth(method=lm , color="red", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Progression of Running Distance",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 4 - Density Plot of of Running Speed over Time

p4<- ggplot(C25K, aes(x=Session, y= Pace_MPH)) + geom_point(color="red") +  geom_smooth(method=lm , color="blue", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Progression of Running Speed",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

## Summary Statistics of Distance
favstats(C25K$Distance)

# Summary Statistics of Pace
favstats(C25K$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(C25K$Session, C25K$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(C25K$Session, C25K$Pace_MPH, method = "pearson")

# Simple Linear Model of Pace & Session
Distance <- lm(Distance ~ Session, data =C25K)
summary(Distance)

# Simple Linear Model of Pace & Session
Speed <- lm(Pace_MPH ~ Session, data =C25K)
summary(Speed)