Blog Feed

FROM COUCH TO 50K

Like a lot of people, I used the lockdown to take up a new hobby: running. Having had a few failed attempts at developing a running habit in the past, I tried a slower, more methodical approach. Eventually, I developed a more sustainable running practice and found the longer distances – especially on trails – could be the most enjoyable, leading me to ultra-running. Fortunately, I stumbled upon a few lessons along the way.

For the longest time, running was simply a means to an end for me. While I was active in sports when I was younger, running was typically utilized in one of three ways: conditioning, weight loss, or punishment. For most sports, we would run to build endurance, or as negative reinforcement when we played poorly. In addition, I was often logging 50 miles or more a week to make weigh ins for wrestling. With all those miles though, running was still simply a means to an end, and not something to be enjoyed. 

Unfortunately, these associations with running would persist for me well into adulthood. As I became older and less active in sports, the need for running dissipated, and I generally became less active over time. Every few years I would get the jolt of motivation to get back in shape, bringing me back to running, but always with the same result: injured, burnt out, and back on the couch.

However, in 2020 several things changed. It was my last semester of grad school, and I was at the least active and unhealthiest weight of my life. I was also spending much of my time analyzing data and presenting findings to business audiences, when I was reminded of the following quote: 

“You can’t improve what you don’t measure.”

Peter Drucker

With the Apple Watch my wife had got me, I set a simple goal: measure my activity level and work to steadily improve it. After a few months of finding a way to carve out time for exercise – thanks in part to a global pandemic – I eventually came back to running. Within three years, I went from not being able to run a few minutes, to running my first ultramarathon. During that span I learned a number of lessons that helped me become a stronger, more durable runner, leading to a more sustainable practice. Some of those lessons were:

Checking the Right Boxes

My goal was not to run races or to get faster, but to build a habit. So, I started by answering a simple categorical question:

“Did I exercise for at least 30 minutes today? “

The image below shows three months of activity from my Apple Watch. On the left is from the month before I started purposefully tracking my activity. In the center represents the first month I started tracking activity. On the right is from a few months later, once I had established a routine of exercising at least 30 minutes a day. Once that routine had become developed, everything else fell into place for me with running.

December, 2019 (left) / January, 2020 (center) / June, 2020 (right)

It is important to note that exercise could be just about anything. Running, walking, stretching, foam rolling, etc. It all counted, as long as it added up to at least 30 minutes. While this may scream “consistency” to most people, please know that is not how what I was striving for. Doing the same workout, or working at the same intensity every day will lead to boredom, burnout, or injury. I wanted continuity, or the state of being unbroken. This is why I focused on carving out time each day, but allowed for a lot of flexibility within it.

Embracing Variability

In a perfect world, our progress would be linear, consistent, and predictable. However, that is rarely how anything works, especially in running. You have to allow yourself to be ok with a tremendous amount of variety, especially as distances increase. For example, walking is prominently featured in ultra-running, due to the terrain and the distance. There is even a name for it: hiking. This means you may have some fast miles and some really slow miles right next to each other, simply out of necessity. In addition, most training plans utilize a polarized approach, where roughly 20% of your workouts are hard (i.e. speed or hill workout), and the other 80% are at an easy (i.e. slow) pace. Finally, most training cycles have weeks where you intentionally cut back mileage or intensity to allow for recovery.

Scatterplot of pace over time (left) / Box-plot of pace by Workout Type

In the charts above, you can see the distance of each run throughout a half-marathon training program. The chart on the left shows a scatterplot of distance over time with a clear, positive slope (red line), but lots of variance in the data. The chart on the right shows box-plots representing the dispersion of running distance by workout type. Notice how these data on the right, show some workout types with very little variance (like segments and intervals), while other workouts have a ton (like we see in the long runs). Context matters, especially when when working with data.

Leaning into Qualitative Data 

While it is easy to gravitate towards numeric (i.e. quantitative) data, qualitative data is the real MVP when it comes to running. Qualitative data may come in the form of text data, like a running journal, or images, video, audio, timelines, etc. Below is a perfect example of the difference between quantitative data and quantitative data from the same run:

Weekend long run at Redondo Beach (June, 2022)

On the left are the mile splits of a typical long run I would do on the weekends. Nothing stands out about that run on the left. The image on the right is from a run along Redondo Beach. I didn’t eat breakfast and got a late start, so I was fading at the end. Right as I was finishing up, I ran into my wife at the beginning of her run. So, I joined her for a few more miles where I really hit a wall and felt awful. Afterwards, we got tacos and beer on the beach, and it was all better, but I learned an important lesson about fueling for runs that day.

While both of those images are from the same run, only one of them was memorable. That’s the power of qualitative data. This is why people recommend keeping a running journal or using descriptive measures like perceived effort during a workout. The better I got at listening to my body, I could run much further and have a much more enjoyable experience, while also not getting injured or burnt out. That came in part from trying to take a broader view, and not becoming overly obsessed with numbers. 

Final Thoughts

As I mentioned in the beginning, my original goal was to develop a long lasting running habit.  By establishing a regular (and achievable) routine, being patient, and learning to listen to my body, I was able to develop something sustainable to build upon. With each new challenge or increase in mileage, I found myself coming back to these principles more and more. Over the course of a few years, that took me from completely sedentary to running my first ultra-marathon at the Salmon Falls 50K:

2023 Salmon Falls 50K

While it is tempting to end this post with a “couch to 50K Training plan,” I will leave that to other experts on the Internet. The truth is, nearly everyone is starting at difference places, has their own obstacles to overcome, and their own unique running goals. With that being said, I am happy to share my own timeline and progress from couch to 50k:

  • Spring 2020 – Started tracking daily activity and later completed the C25K training program, which I write about here.
  • Summer 2020 – Wanted to work on running faster, so I completed the faster 5K program, which I write about here. 
  • Fall 2020 – Completed the Half Marathon Goal program by RunTracker, which I write about here
  • Summer 2021 – Worked over the summer to improve my 5k/10k times.
  • Fall 2021 – Trained for and ran my first half marathon (Philadelphia Half Marathon).
  • Spring 2022 – Participated in a training group through Fleet Feet. Ran the Shamrock’n Half Marathon. Joined a trail running group.
  • Summer 222 – Ran the Dirty Secret Trail Run, the Napa to Sonoma Half Marathon, and the Blood Sweat & Beers Trail Run.
  • Fall 2022 – Trained for and ran my first marathon (California International Marathon).
  • Spring 2023 – Trained for and ran first ultramarathon (Salmon Falls 50K):

As you can see, I opted to take the long term approach. Also, it should be noted that running a 50K was not something I initially set out to do. Not even close. Things just went that way once I was introduced to trails. Surprises like that have been one of the most enjoyable parts of running for me. It seems like there is always some new adventure just around the corner.

If you have any questions, or want to share your thoughts, I would love to hear them in the comment section below. Thanks for reading!

2 responses to “FROM COUCH TO 50K”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Up & Running with R

In previous posts, I have talked about the value of knowing a scripting language, like R, for statistical analysis. As an open sourced software, R allows you to do advanced statistical analysis and build robust models for prediction and analysis, in addition to being an excellent tool for data wrangling and data visualization. However, the biggest barrier for entrance for most people is learning the language itself, but that doesn’t need to be the case.

Scripting languages are often approached by learning the grammar of the language first through drills, before eventually getting to statistical analysis and visualization. What’s interesting to me about that approach is it’s not at all how people learn a language. When we learn to speak – or learn a new language in general- we typically do so through imitation, experimentation, and a whole lot of trial and error. Learning a scripting language is the same. This blog post aims to help users get up and running quickly in R with some simple code that can be adapted for statistical analysis. All charts and analysis can be replicated using the code chunks below or the raw code and data. If possible, I suggest an Interactive Development Environment (IDE) like RStudio.

The Basics

To get started, I have found it’s easiest to take a linear model approach, such as:

function( Y ~ X, data = DataSetName )

Below is a short description of each part in the pseudo-code above:

  • Y is the outcome of interest (response variable)
  • X is some explanatory variable (or you can use “1” as a placeholder if there is no explanatory variable)
  • DataSetName represent data loaded into the R environment

Note that an R function dictates something you want to do with your data.

Setting up your environment

Most people that use open source programs like Python or R will agree on one thing: the packages are what make them great. R, in its most basic format – which is often referred to as “Base R” – works like a really big calculator. Base R comes with a number of functions for statistical analysis and plotting, but since R is open source, there is a large community of creators who share packages. This expands the capability of the program tremendously.

There are two steps when you want to use a package:

  1. Install the package from the source
  2. Call the package for use in your program

The best analogy I have heard is if you think of an R package like a song you buy online. You buy each song one time to add them to your music library. If you want to make a playlist, you need to go to your music library and select the songs you want to hear. R works the same way, but with one major difference: everything is free. Below is how you would install and load the packages necessary for this analysis:

# Note: the hashtag character (#) designates user comments... Don't afraid to use them *very* generously to document your code. 

# Install Packages (you only have to do this once)
install.packages("tidyverse")
install.packages("mosaic")

# Load Packages
library(tidyverse)
library(mosaic)

The Data

For this post, I have included a dataset of from my GitHub account of two different half marathon training seasons: one from 2020 & one from 2021. I used the same running app for both of them, which had a consistent structure to the training plans, allowing for a number of comparisons to be made. For this post, we will keep things simple and focus only on these 4 variables to demonstrate a variety of functions, visualizations, and tests:

  • Attempt – Categorical variable indicating if a run was on the first or second attempt with the program.
  • Distance – Continuous variable representing the distance, measured in miles.
  • Workout – Categorical variable representing the four possible workout types: Easy, Intervals, Long, Race & Segments.
  • Session – Ordinal variable, numbered 1-46, representing which run (i.e “session) it was within the program. Note: given the large number of levels, we will treat this as a continuous variable for these examples.

Loading & Viewing Data

Thanks to the open source nature of this software, there are packages and functions for nearly every type of data file. In addition, there are some simple functions that allow you to inspect the data in a variety of ways. The best part is it doesn’t take too much coding:

# Data intake:

Running <- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv")

# View the data frame:

Running
Figure 1 – Printout of Running dataframe

In the database above, we can see that there are 92 rows (i.e. observations) on 13 variables. As long as you have loaded tidyverse into your IDE (i.e. RStudio), you should be able to scroll freely through the data interactively. Typically, we just want to peek at the data though, which you can do these functions:

  • names() – Shows the names of the headers (i.e. variable names)
  • head() – Shows the first several rows of a matrix or data frame
  • tail() – Shows the first several rows of a matrix or data frame
  • glimpse() – Transposed version of print, making it possible to see every column in a data frame.
  • str() – Transposed version of print, showing only the first few rows (i.e. similar to the head() function, but listed horizontally, instead of vertically).

More times than not, we need to do some manipulation of the data before we do any kind of analysis. If I am being honest, cleaning, joining, and reshaping data typically makes up about 80% of my time on data projects. R can be great for this, and one helpful function from the Tidyverse is the “pipe operator.” The pipe operator (%>%) allows you to think linearly and “pipe” data through different functions, almost endlessly. This can be really helpful with filtering, mutating, reshaping, and cleaning data.

Below is a simple example of the pipe operator in action; selecting only the four variables of interest: Attempt, Session, Workout, & Distance. From there, the head() function shows the first few rows of data, demonstrating how we have only those variables of interest:

# Select variables of interest and overwrite "Running" dataframe

Running <- Running %>% select(Attempt, Session, Workout, Distance)

## Note the "pipe operator (i.e. %>%) above. This is a great tool for "piping new

# view just the first few rows to confirm only the variables of interest
head(Running)
Figure 2 – Example of head() function showing variables of interest

Summarizing Data

Base R has a number of built in functions for summarizing data. These can come in handy when needing to make quick calculations, and work in the linear format we have referenced above. In this example we are calculating the mean of the Distance variable within the Running data frame:

mean(Distance ~ 1, data = Running)

We can also use this approach – note the $ symbol connecting the variable within the data frame – yielding the same result:

mean(Running$Distance)

Some other summary statistic functions that are built into base R include:

  • mean() – Calculates the arithmetic mean (i.e. “average”) of the column selected
  • median() – Calculates the median of the column selected
  • mode() -Determines the mode of the column selected
  • min() – Determines the minimum value of the column selected
  • max() – Determines the maximum value of the column selected
  • sd() – Calculates the standard deviation of the column selected
  • sum() – Calculates the sum of the column
  • range() – Distance between the highest and lowest data points of the column selected
  • iqr() – Displays the interquartile range (middle 50% of the data) of the column selected

Instead of using the options above, the Mosaic package contains a number of functions for computation, calculus, statistics, & modeling. For example, the favstats() function does a number of summary statistics, including a five number summary (min, first quartile, median, third quartile, & max), along with standard deviation, mean, number of missing observations, and total number of observations:

# Five Number Summary using favstats function:

favstats(Running$Distance)

# Same thing coded another way (consistent with the format of later examples):

favstats(Distance ~ 1, data = Running)
Figure 3 – example of favstats() function showing summary statistics

The favstats() function also allows you to summarize by groups. In this example, the same statistics are calculated for the Distance variable by the 5 different workout types:

# Favstats, separated by Workout type

favstats(Distance ~ Workout, data = Running)
Figure 4 – Example of favstats() function by group

Plotting Basics

One of R’s greatest advantages is its ability to create customized visualizations. When you incorporate the pipe operator, you can write in layers, adding more detail with each one. You simply start with the kind of chart you want to use, like:

  • gf_histogram() – Plots a histogram
  • gf_density() – Generates a density plot
  • gf_boxplot() – Creates a boxplot
  • gf_violin() – Generates a violin plot
  • gf_point() – Creates a scatterplot

Then “pipe” the customizations you want to add:

  • gf_labs() – Adds labels to the plot
  • gf_theme() – Allows user customize layout & themes
  • gf_lm() – Adds an Ordinary Least Squares (OLS) line to the plot
  • gf_smooth() – Adds a smoothing function to the OLS line to account for curvature in the data.
  • geom_jitter() – Add noise to a numeric vector to remove overlaps (ie.”to break ties”)

Below are some examples of basic data visualizations using the ggformula functions listed above. Notice with each chart how these charts become increasingly customized by using the functions above:

Histogram
# Plot histogram of Distance variable:

gf_histogram(~Distance, data = Running)
Plot 1 – Histogram of Distance variable
Density Plot
# Density plot of Distance variable, adding a title:

gf_density(~Distance, data = Running) %>%
gf_labs(title = "Distances Ran")
Plot 2 – Density plot of Distance variable
Boxplots
# Boxplot of Distance by Attempt; adding subtitle & caption

gf_boxplot(Distance ~ Attempt, data = Running) %>%
  gf_labs(title = "Boxplots of Distance by Workout", subtitle = "Half Marathon Running Data", caption = "Up & Running in R")
Plot 3- Boxplot of Distance by Attempt

Basic Statistical Modeling

Now that we have the basic linear approach to coding in R, we can pair visualizing with modeling to gain a clearer picture of the data. Below are some examples of some basic statistical tests with the variables from the Running dataset.

Two Sample T-Test

The variable ‘Attempt’ refers to whether or not the run was on the first or second attempt at the program. The running plan was based on time intervals, not mileage, so looking at distance between attempts would create a logical comparison. Given that ‘Distance’ is a continuous variable and Attempt is a categorical variable with two levels, we can evaluate this model using the t.test() function.

# Two Sample T-test between Distance and Attempt
t.test(Distance ~ Attempt, data = Running)
Figure 5 – Example of t.test() function

Note – Since the linear approach is an additive model, it is possible to collapse it all the way down to a t.test, giving us the same results:

# Same test, but using a linear model approach, yielding the same result:

model_1 <- lm(Distance ~ Attempt, data = Running)

summary(model_1)
Figure 6 – Example of T-test using a linear approach

Visualizing these variables can be done with box plots, or through violin plots like in the chart below. Violin plots work like a hybrid of a box plot and a kernel density plot, showing the peaks and valleys in the data:

# Violin plot of Distance by Attempt; adding caption
gf_violin(Distance ~ Attempt, data = Running) %>%
  gf_labs(title = "Violin Plots of Distances Ran", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Plot 4 – Violin plot Distance by Attempt

Simple Linear Regression

To investigate distances ran over time, we can plot (and test) these using an Ordinary Least Squares (OLS) model (i.e. “regression”). In this example, we have Distance as the dependent (i.e. “outcome”) variable and Session as the independent (i.e. “predictor”) variable:

# Simple linear model of Distance over Session:
model_2 <- lm(Distance ~ Session, data = Running)

summary(model_2)
Figure 7 – Example of a simple linear (i.e. Ordinary Least Squares) regression

This model can be visualized with a simple scatterplot. An OLS regression line was added to this plot demonstrate how well the model fits the data. As we can see in the plot below, there is clearly a non-zero (positive) slope to this model. However, there are also some clear patterns and bifurcations in this data that are not well accounted for by this model, providing a clear example of under-fitting:

gf_point(Distance ~ Session, data = Running)%>%
gf_lm() %>%
  gf_labs(title = "Scatterplot of Distances Ran by Session", subtitle = "Half Marathon Running Data", caption = "Up & Running with R")
Plot 5 – Scatterplot of Distance over Session

Analysis of Variance (ANOVA)

The variable ‘Workout’ has five different categories: Easy, Intervals, Long, Segments, & Race. With these five factors, we can investigate continuous variables like Distance by employing the same linear approach. In this example, we have Distance separated by Workout type, using an Analysis of Variance (ANOVA):

# One Way ANOVA of Distance by Workout Type:
model_3 <- lm(Distance ~ Workout, data = Running)

# Summarize model:
summary(model_3)
Figure 8 – Example of One Way ANOVA using linear approach

To visualize a One-Way ANOVA like this example, we can use box-plots or Violin plots. In the example below, we can see clear differences in distances ran by workout type, which is not surprising, given the structure of the running program:

# Boxplots of Distance by Workout, adding the 'Jitter' function:

gf_boxplot(Distance ~ Workout, data = Running, fill = ~ Workout) %>%
  gf_labs(title = "Boxplots of Distances ran by Workout Type", subtitle = "Half Marathon Running Data") %>%
  gf_theme(legend.position = "none") +
  geom_jitter()
Plot 6 – Boxplots of Distance by Workout

Multivariate Modeling

Since R is a vector based language, it works great with linear models, which are additive by nature. In the ANOVA example above, we saw some clear divisions in the data with respect to distances ran by workout type. Consequently, any model we may want to build with Distance as the dependent variable should include the Workout variable, in addition to the Session variable in the regression example. This provides a good illustration of the additive nature of linear models with the Distance as the dependent (i.e. “outcome”) variable, and both Session & Workout as the independent (i.e. “predictor) variables:

# Create Model of Distance over Session, by Workout:

model_4 <- lm(Distance ~ Session + Workout, data = Running)

# Summarize model:

summary(model_4)
Figure 9 – Example of multiple regression model

In the output above, we have an example of a multiple regression model with multiple slopes and intercepts, represented by the significance codes (i.e. “*”, “**”, & “***”) next to the predictor variables on the right. Visually, this final model is represented in the plot below, with Distance on the Y axis, Session on the X axis, and each Workout type represented by a color. Notice the clear example of multiple slopes and intercepts in this visual example:

# Scatterplot of Distances ran by Workout Type over Session:

gf_point(Distance ~ Session, data = Running, color = ~ Workout) %>%
  gf_labs(title = "Distances Ran by Session & Workout Type", subtitle = "Half Marathon Running Data") %>%
  gf_theme(legend.position = "right") %>%
  gf_lm()
Plot 7 – Scatterplot of Distance by Session & Workout

Final Thoughts

By now you should be able set up the R environment, load & view data, create basic statistical visualizations, and model data using a linear approach. With the code chunks provided, you should be able to adapt the code to look at these data however you see fit. Even better would be to branch out and analyze data of your choosing. Numerous datasets come standard in R, and don’t even need to be loaded. Some commonly used examples are the iris, cars, mtcars, diamonds, and titanic datasets. If you want to keep with the running theme, I have numerous datasets and analysis (with code examples) at the links below:

Thanks for reading!

It’s Time to Ditch SPSS

If you were trained in a social science, there is a good chance that you have had to use, or still use SPSS. For most social science research, SPSS is a powerful program and arguably the industry standard in academia, especially in the social sciences. Statistical Package for Social Sciences (SPSS) is a menu driven software that is easy enough to learn – as long as teachers provide assignments and tutorials – to get you analyzing data quickly. In other words, SPSS is a good program for statistical analysis, and the barrier for entry is not too high, making it a great fit for social scientists who are just learning statistics. However, SPSS does have its limitations and there are options out there that are not only more robust, but completely open source (i.e. free). 

Photo by Kevin Ku on Unsplash
Open Course Software:

One of the advantages of menu-driven statistical software has been the ease of use. Fortunately, many of the open source options – like Python and R – have come a long way towards lowering the learning curve in recent years. Graphical User Interfaces (GUIs), such as RStudio & Jupyter Notebooks (among many others) have made these programs much easier to learn, and come with many features unavailable in menu-driven programs like SPSS. In addition, when working on projects with Python or R, you can save them in a way that allows you to replicate, expand and share your analysis easily. Finally, what really sets languages like R and Python apart – besides being free – is that there a number of packages available from a community of people who like to share and help answer problems. This culture of learning and sharing has created options to do advanced statistical analysis and modeling – including machine learning – in addition to highly customizable data visualizations and tools for extracting, transforming, and loading (ETL) to create efficient data pipelines.

Transparency:

The ability to be transparent in your data analysis cannot be stressed enough. In statistics, it is easy to put a variable in the wrong place or select the wrong test, but still get results that are statistically significant (and wrong). Typically the people who are reading your analysis don’t get to see your raw data and the processes you went through with your analysis. Black box programs, like SPSS, only compound this. With language based programs like R and Python, you show every aspect of your work along the way; providing greater credibility in the process. Letting other researchers see how you reached your conclusion – or at least giving them the ability to – only strengthens your research and analysis. 

Demand:

One more advantage to learning a code based program like Python or R is that those skills are in demand. If you are a graduate student or just an academic that is looking for jobs outside of the academy, having skills in data science are very marketable. Industries everywhere are wanting to gain greater insights into their sales, operations, and consumer base. In addition, the federal government and many academic institutions need people with skills to handle the large databases that aren’t as easily compatible with many of the menu driven software programs like SPSS. 

What about SAS and Stata?

Both SAS and Stata are statistical software packages that rely on a code based language. These programs are not free, but they are very powerful software programs that offer a lot of options, and work with large datasets. For years, SAS has been the preferred software for manufacturing and the medical field while Stata is often used in the political science and survey field. Consequently, some organizations simply use it because they don’t want to take the time and money to train their employees in other languages, not to mention the legacy code they have relied on for years, if not decades. The big drawback of these two programs, besides cost, is you can’t easily access the packages available in open source software, like Python or R.

Versatility of Python and R

If you know how to program in Python or R, it much easier to switch over to other languages. To effectively code in Python or R, you have to clearly understand your variables and the algorithm. In other words, you have to tell the software exactly what to do. Once you get past the basics, these languages are not difficult to understand. The problem for most people though is getting past the basics, which just takes some patience, and honestly a lot of trial and error. Once you are comfortable summarizing, analyzing and visualizing data with language based software, you’ll have a deeper understanding of what the data is actually doing. More importantly, you can easily transfer those skills to other programs. 

Personally, I began on SPSS. It did more than what I needed for my classes in behavioral statistics. When I took a couple of applied statistics course we used Minitab, in part because it was developed and is still housed at Penn State (so it was free to us). Those two applied stats classes didn’t care what software you used, but they had clear tutorials that went along with our free access to Minitab, so I generally used that. Eventually, I started taking even more advanced stat classes which worked exclusively in R. We were required to turn all homework in through Markdown, knitting the document to html. We got up and running quickly with the Mosaic package and were able to do some interesting analysis and date visualization right away. After I completed my graduate certificate in Applied Statistics, I just kept on learning how to code better and analyze data in R. Once I had solid foundations with R, I was able to learn other languages like SQL, SAS, & Python fairly easily (when needed), because the conceptual foundation had already been set. As a result, I ended up being competitive for numerous jobs outside of academia in both government and industry.

Which option to choose?

This choice all depends on what kinds of work you see yourself doing. Personally I prefer R, because it works like a really fancy calculator, making it great for modeling. Also, there are great packages for visualizing like GGPlot, in addition to packages like dplyR for extracting, transforming, and loading (ETL) data. With that being said, I have worked at places where everyone uses SAS, so I did too. Finally, most positions in data science & statistics now typically list Python (in addition to SQL) as necessary languages. So if I were just starting out, that is probably where I would invest my time. Relatedly, there is one software program you will almost never find in data job postings: SPSS.

Thanks for reading!  

From Couch to Half Marathon

In the fall of 2020 I set out to be more active and took up running as a hobby. Right as I completed the Couch to 5K Program (C25K), lockdowns were being implemented across the country and I found myself with a lot more time on my hands. So, I set out to improve speed next by getting my 5K time to under 30 minutes before shifting my focus to running my first ever half marathon. This blog post hopes to take you on the journey with the data I collected along the way.

Going from couch to half marathon took me through three different running plans, using two different iPhone apps. The first running plan I used was the Couch to 5K Program (C25K); a standalone app and plan created by Active. To improve speed, I used the “Tempo Run: 5k” training plan, followed by distance using the “Half Marathon Goal” plan, both found within the RunTracker Pro app. Each of these apps had simple to follow prompts telling you when to run, walk, or pick up the pace, and are designed to progressively build speed and endurance over time.

The C25K running training plan utilizes the run / walk method and includes 3 runs per week – each between 20 and 30 minutes – with the program lasting 9 weeks in total. Over the course of the 27 training runs, the proportion of walking decreases while the proportion of running increases, culminating with three 30 minute runs in the last week of the program. The “Tempo Run: 5k” plan consisted of three runs per week for a total of eight weeks, with the same structure each week: an interval run, a tempo run, and a base run. Similar to the C25K plan, runs progressively increase in both mileage and intensity throughout. Finally, the “Half Marathon Goal” running plan consisted of four runs per week – a base run, an interval run, a tempo run, and a long run – for a total of twelve weeks. In this plan, each week ends with a long, slow distance (LSD) run, culminating in a final run of 2 hours and 15 minutes in the last week of the program. In the graphs below, we see great representations of both normal (bottom) and positively skewed (top) distributions when we look at speed and distances ran throughout these programs:

Overall Distribution of Running Distances & Paces

Given that each program had different goals, we see some clear distinctions between each of them. Unsurprisingly, the Half Marathon program featured the longest runs and the largest spread (i.e. variance) with respect to distance, but the least amount of variability with respect to speed. Another expected result was the with Tempo Run: 5K program, which featured the fastest runs with the least amount of variability in distance throughout the program. These results are clearly represented in the box plots below:

Distribution of Running Distances and Paces by Program

Since there was an ordered component to these programs, the best way to view these data is through a scatter plot, which allows us to vizualize progress over time. We can see that running pace improved at a significantly greater rate in the C25K & Faster 5K program when compared to the Half Marathon plan, which makes sense, given their respective goals. This also explains the curvature in the data when looking at running pace. When investigating distance, we see that most runs stayed within 2 to 4 miles throughout each program, with the exception of the long weekend runs in the Half Marathon plan, which clearly separate themselves from the pack linearly over time:

Scatter Plots of Running Distances & Paces over Time

Final Thoughts

While I initially did not set out to go from Couch to Half Marathon, that is what ended up happening, thanks to a few inexpensive running apps and some extra time on my hands due to a global pandemic. The C25K app is a great resource for anyone who is looking to get into running. Employing the run/walk method, the program consists of 27 runs, spread out over 9 weeks. To run faster I completed the Tempo Run: 5K (ie. Faster 5k) plan, before tackling the Half Marathon Goal plan, both of which were subsumed with the Runtracker Pro App. Both of these apps are inexpensive and helpful resources for those who are interested in getting into, or improving their running.

One word of caution: Many people who have completed this program inculcate that you should not be afraid to add extra rest days or repeat workouts as needed. I would agree with that. More importantly, you absolutely should not skip ahead, nor should run on back to back days in the beginning. The quickest way to halt any progress is through injury, so take your time and enjoy the run!

Below are links to posts breaking down each of the programs individually, along with the raw data and code used to create the charts and analyis.

Thanks for reading!

Couch to 5K

Faster 5K

Half Marathon Goal


# clean up (this clears out the previous environment)
ls()

# Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)
library(hrbrthemes)
library(viridis)

# Likert Data Packages
library(psych)
library(FSA)
library(lattice)
library(boot)
library(likert)

#install.packages("wordcloud")
library(wordcloud)
library(tm)
library(wordcloud)


# Grid Extra for Multiplots
library("gridExtra")

# Multiple plot function (just copy paste code)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}


# Couch to Half

# Import data from CSV, no factors

Couch2Half <- read.csv("Couch2Half.csv", stringsAsFactors = FALSE)

Couch2Half <- Couch2Half %>%
  na.omit()

Couch2Half

Couch2Half %>% 
  count(Program)

ggplot(Couch2Half, aes(x = Program, fill = Program)) +
  geom_bar() + 
  labs( x ="", y = "Speed (Miles per Hour)", title = "Runs by Program",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))

# Plot 1 - Density Plot of Running Distances

p1 <- ggplot(Couch2Half, aes(x=Distance)) + 
  geom_density(color="#E69F00", fill="#999999") + labs( x ="Distance (Miles)", y = "", title = "Running Distances",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

# Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(Couch2Half, aes(x=Pace_MPH)) + 
  geom_density(color="#E69F00", fill="#56B4E9") + 
  labs( x ="Pace (Miles per Hour)", y = "", title = "Running Paces",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

# Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)


# Plot
p3 <- Couch2Half %>%
  ggplot( aes(x=Program, y= Distance, fill=Program)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Distance (Miles)", title = "Distance by Workout",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))
  

# Plot
p4 <- Couch2Half %>%
  ggplot( aes(x=Program, y= Pace_MPH, fill=Program)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Speed (Miles per Hour)", title = "Speed by Workout",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
  scale_fill_manual(values=c('#999999','#E69F00', '#56B4E9'))


# Combine plots using multi-plot function
multiplot( p3, p4, cols=2)


p5 <- ggplot(Couch2Half, aes(x=Run, y= Pace_MPH, color = Program)) + geom_point() +  geom_smooth(method=lm , color="Black", se=TRUE) + labs( x ="Training Session", y = "Pace (Miles per Hour)", title = "Running Pace",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank()) + scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))



p6<- ggplot(Couch2Half, aes(x=Run, y= Distance, color = Program)) + geom_point() +  geom_smooth(method=lm , color="Black", se=TRUE) + labs( x ="Training Session", y = "Distance (Miles)", title = "Running Distance",  subtitle = "Couch to Half Marathon", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank()) + scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))

# Combine plots using multi-plot function:

multiplot( p5, p6, cols=1)


# Summary Statistics of Distance
favstats(Couch2Half$Distance)

# Summary Statistics of Pace
favstats(Couch2Half$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(Couch2Half$Session, Couch2Half$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(Couch2Half$Session, Couch2Half$Pace_MPH, method = "pearson")

Stumbling Into Statistics

One of the most common questions I get is how I found my way into statistics and data science. Honestly, it wasn’t on purpose, nor was it something I ever imagined. It just happened to work out that way, and I have found that the job can be quite interesting and fulfilling. Nevertheless, for someone who hadn’t taken a stats class until their forties, imagining me as a data scientist may seem wild; so allow me to explain:

When I was in grad school, my research interests were in Online Learning and Self Efficacy (i.e. confidence), both of which required a deeper understanding of survey design and measurement. For anyone that has done much survey research, they will be able to tell you that it is relatively easy to conduct a survey, but incredibly difficult to do it well. Fortunately, many universities have courses that are specific to deepening those skills, but they often require additional coursework in statistics as a prerequisite. While I could have taken another class in the social statistics, I wanted to see what it would be like to take a class in applied statistics. Also, I was curious if I could hang in a class with people from the hard sciences. Turns out I could, and I lucked out with a great professor who further sparked my interest. Inspired to learn more, the following semester I took another course in Applied Statistics (Sampling Methods), in addition to the Survey Design class I had originally wanted to take. By that point, I was fully invested in learning as much as I could and followed up with coursework in Regression Methods and Design of Experiments. In these classes  I learned to use scripting languages, like Python and R, to clean, visualize, and model data. Once I had those skills, things really started to take off.

The coursework where we were required to write in R helped me tremendously. First off, I am a strong believer that “writing is thinking.”  When you use a scripting language for statistical analysis, you literally have to write out your models, which reinforces understanding. Since R is a vector based language, it works like a really big calculator; making it great for modeling and visualizing, as well extracting, transforming, and loading (ETL) of data. Statistics and Machine Learning can often times seem like computer magic to some people. I can assure you that it’s not. Most of the time the math is based on relatively simple concepts; and we let the software do heavy lifting with respect to calculation. This gives you the ability to create projects that can be replicable, transparent, and shared when using a scripting software like R.

Photo by Markus Spiske on Unsplash

By the end of grad school,  I had leveraged these skills into part time work in statistics and data science to bring in some extra income, while doing something I enjoyed. Some of the projects I worked on ended up being a lot of fun, and were very well received, which led to more and more work. Then came a global pandemic that upended how many people viewed their work / life balance, so I decided to find a full time position as a data scientist and haven’t looked back since. 

Reflecting on my transition into statistics and data science, a few lessons stand out. The most important one is the role of finding data projects to work on. There is a reason why educational theorists inculcate the importance of Project Based Learning (PBL). Being able to investigate a problem by finding, cleaning, transforming, analyzing, and communicating the story of the data is the most valuable experience you can have if you are looking into a career in data science.  This is where the role of a scripting language comes in. While menu driven statistical programs like SPSS and Minitab used to be the norm,  scripting languages such as Python, R, SQL, etc. are the standard now for their flexibility and the fact that most of them are completely open sourced. Finally, I wasn’t prepared for how much my experience teaching and presenting would help me as a data scientist. Many people are scared of math, and many statistician aren’t the best at communicating with non-statisticians. So, if are able to tell the story of the data, you can bring a lot of value to an organization. 

Below are some resources I have found particularly useful along the way. If you have any questions or advice about data science, please leave them in the comments below!

Thanks for reading!

Resources

Running Through the Data: Half Marathon Goal by RunTracker

In the later half of 2020, I set a new goal for myself: run 13.1 miles by the end of the year. Earlier in the year I had completed the couch to 5k program and later set the goal to improve my time to under 30 minutes. Given the extra time at home thanks to a Global Pandemic, I set my sights on the half marathon distance. Since I was already familiar with the RunTracker app, I decided to stick with that and used their “Half Marathon Goal” training plan.

The Runtracker app, made by the Fitness 22 company, features a series of running plans tailored to individuals’ current fitness levels and goals. The “Half Marathon Goal” running plan consisted of four runs per week for a total of twelve weeks, with a consistent structure throughout most of the program. After a series of base runs in the first week, the next ten weeks featured a base run on Tuesdays, segments on Thursdays, intervals on Fridays, and long run on Sundays. Duration of workouts increase steadily over the course of the first ten weeks before tapering in the final two weeks of the program.

My experience with this running plan was great once I got used to the structure. Previously, the most I had run was three days a week, while this program requires four. This means there would be runs on consecutive days, which I was not used to. Having just finished a training plan geared towards speed work, I quickly learned I would need to slow down if I was going to keep from getting inured. Once I got settled into the format, mileage built progressively and speed eventually followed. By the end of the twelve-week program, I was able to confidently run 13.1 miles using my usual training route, which coincidentally looked like a shoe:

Distance & Pace

Since my goal was to complete a half marathon, the primary variable of interest was obviously distance. Like most runners, I also tend to focus on times, so average running pace served as the secondary variable of interest. Distances ran throughout the training program ranged from 2.14 to 13.12 miles per run, with a mean of 4.85 miles per run. Running paces ranged from 5.16 to 6.1 miles per run (11:38 to 9:50 min/mile ), with a mean of 5.54 miles per hour ( 10:50 min / mile). The distributions of my runs by distance and speed for this program can be seen in the density plots below:

Comparing Workouts

When taking a closer look at these distributions by workout type, we can see some clear patterns in the data. Distances for base runs, interval sessions, and segments, remained relatively close to one another, ranging from 2.14 to 6.02 miles per run. The long runs on Sundays though lived up to their name, ranging from 5.7 to 13.12, with an average of 9.16. Running pace for all workout types were somewhat consistent between groups, with each workout type averaging between 5.5 and 5.6 miles per hour. Distributions by workout type for distance and pace can be seen in the box plots below:

Training Progress

Given that there is an ordered component to training, we can look at these data linearly (i.e. regression). Below are scatter plots of distances covered and running speeds over the course of the 46 training runs in the program. We see a slightly positive association with trainings volume (mileage), while intensity (pace) remained relatable stable throughout the training program. When you take a closer look at the distance plot, we can see how the majority of volume is gained in training through the long runs on weekends, which is typical of most long distance training programs:

Cadence & Heart Rate

Two important considerations for runners are heart rate and cadence. When runners let their heart rates get too high, they tire much quicker. So, distance runners constantly work to keep their heart rate down while still running quickly. This can be aided by increasing cadence to the rate of approximately 180 beats per minute. Increasing cadence allows runners to develop better efficiency in their technique – typically by shortening the stride – which over time can lead to a lower heart rate. This translates into better performance with respect to both speed and endurance. In the plot below we can see that both cadence and heart rate are positively associated with running pace, with a clear interaction between these two variables as speed increases, represented by the slopes crossing one another:

Final Thoughts

The “Half Marathon Goal” plan on the RunTracker app is geared towards regular runners who are ready to tackle the 13.1 distance. The training structure consists of three runs per week with a base run, a session of mile repeats, an interval session, and one long run on the weekend. The variety of workouts in the program are designed primarily to build the strength and endurance to run a half marathon, with some speed work included to build anaerobic capacity as well. For anyone who has been running for a while and is ready to tackle longer distances, this program could be an excellent option.

Below are some links related on running a first half marathon, along with the raw data and code used to create the charts and analysis.

Thanks for reading!

Resources & Code

# FRONT MATTTER

### Note: The HM_1.xlxs file will need to be converted to HM_1.csv to read in correctly. Also, all packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

## clean up (this clears out the previous environment)
ls()

## Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)
library(hrbrthemes)
library(viridis)

## Likert Data Packages
library(psych)
library(FSA)
library(lattice)
library(boot)
library(likert)

## Grid Extra for Multiplots
library("gridExtra")

## Multiple plot function (just copy paste code)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}


# HALF MARATHON GOAL by RUNTRACKER

## Import data from CSV, no factors

HM_1 <- read.csv("HM_1.csv", stringsAsFactors = FALSE)

HM_1 <- HM_1  %>%
  na.omit()

HM_1 


## Plot 1

p1 <- ggplot(HM_1 , aes(x=Distance)) + 
  geom_density(color="Pink", fill="Pink") + labs( x ="Distance (Miles)", y = "", title = "Running Distances",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())


## Plot 2

p2 <- ggplot(HM_1, aes(x=Pace_MPH)) + 
  geom_density(color="light blue", fill="light blue") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Running Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())


## Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

## Plot 3

p3 <- ggplot(HM_1 , aes(x= Session, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Red", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Running Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 4

p4<- ggplot(HM_1 , aes(x=Session, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Blue", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Running Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

## Summary Statistics of Distance
favstats(HM_1$Distance)

## Summary Statistics of Pace
favstats(HM_1$Pace_MPH)



## Pearson Product Correlation of Distance over Time (session)
cor.test(HM_1$Session, HM_1$Distance, method = "pearson")

## Pearson Product Correlation of Pace over Time (session)
cor.test(HM_1$Session, HM_1$Pace_MPH, method = "pearson")


## Plot
p5 <-  HM_1 %>%
  filter(Workout != "Race") %>%
  ggplot( aes(x=Workout, y= Distance, fill=Workout)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Workout Type", y = "Distance (Miles)", title = "Comparing Distances",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Reds")
  
## Plot
p6  <-  HM_1 %>%
  filter(Workout != "Race") %>%
  ggplot( aes(x=Workout, y= Pace_MPH, fill=Workout)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Workout Type", y = "Speed (Miles per Hour)", title = "Comparing Paces",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Blues")

## Combine plots using multi-plot function
multiplot( p5, p6, cols=2)

## Plot 7

p7 <- ggplot(HM_1 , aes(x= Cadence, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Red", se=TRUE) + labs(x ="Average Running Cadence", y = "Distance (Miles)", title = "Cadence by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())


## Plot 8

p8<- ggplot(HM_1 , aes(x=Cadence, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Green", se=TRUE) + labs( x ="Average Running Cadence", y = "Speed (Miles per Hour)", title = "Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())


## Plot 9

p9 <- ggplot(HM_1 , aes(x= Avg_Heart_Rate, y= Distance)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Blue", se=TRUE) + labs(x ="Average Heart Rate", y = "Distance (Miles)", title = "Heart Rate by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 10

p10<- ggplot(HM_1 , aes(x=Avg_Heart_Rate, y= Pace_MPH)) + geom_point(color="Black") +  geom_smooth(method=lm , color="Purple", se=TRUE) + labs( x ="Average Heart Rate", y = "Speed (Miles per Hour)", title = "Heart Rate by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p7, p8, p9, p10, cols=2)

## Pivot data from wide to long for next chart

HM_1A <- gather(HM_1, Measurement, BPM, Cadence, Avg_Heart_Rate)

HM_1A

## Plot 11

p11<- ggplot(HM_1A , aes(x=Pace_MPH, y= BPM, Color= Measurement)) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs(x ="Average Pace (Miles per Hour)", y = "Beats per Minute", title = "Heart Rate & Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p11

## Plot 12

p12<- ggplot(HM_1A , aes(x=Distance, y= BPM, Color= Measurement)) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs( x ="Average Distance in Miles", y = "Beats per Minute", title = "Heart Rate & Cadence by Distance",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p12

# Combine plots using multi-plot function
multiplot( p11, p12, cols=1)



## Plot 13
p13 <- ggplot(HM_1A , aes(x = Pace_MPH, y = BPM, color = Measurement) ) +
     geom_point() +
     geom_smooth(method = "lm", alpha = .15, aes(fill = Measurement)) + labs(x ="Average Pace (Miles per Hour)", y = "Beats per Minute", title = "Heart Rate & Cadence by Pace",  subtitle = "Half Marathon Goal by Runtracker", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"))

Running Through the Data: Tempo Run: 5K by Runtracker

In the Summer of 2020, I set a really simple goal for myself: run a 5k under 30 minutes. At the time, I had just completed the couch to 5k (C25K) program and was able to complete the distance in around 32-33 minutes, but couldn’t seem to get much quicker than that and wanted to see if trying a different training plan would help. After some experimenting, I settled on the Tempo Run: 5k Plan on the Runtracker app to help me break the 30-minute mark.

Runtracker is an app made by the Fitness 22 company, featuring a series of running plans tailored to individuals’ current fitness levels and goals. Since I was a runner who could currently run the 5k distance and ran about 3 times a week, the app recommended the “Tempo Run: 5k” plan. This running plan consisted of three runs per week for a total of eight weeks, with the same structure each week. The first run of the week consisted of interval training of various lengths throughout the program, while the second  run of the week was always a tempo run of steadily increasing durations. The third and final run each week was a 35-minute base run at an easy pace. This format remained consistent over the course of all 8 weeks and was built to progressively increase both mileage and intensity throughout.

Tempo Run: 5K Training Plan, by Runtracker

My experience with this running plan was great for a variety of reasons. The most structured kind of running I had done before was the run/walk method used in couch to 5k (C25K). Interval sessions, which included high intensity running, easy pace running, and walking helped build power and figure out pacing. Tempo sessions pushed me to find the gear between interval and easy pace, which helped develop the habit of running the second half of my runs, faster than the first (i.e. “negative splits”). The long easy sessions on the weekends helped build confidence and efficiency. By the end of the program, I had taken minutes off my 5K time and had a way better understanding of pacing, which was the biggest takeaway for me. Many of the things I do now as a runner, mirror the types of workouts I was first introduced to in this app, so this data has been fun to look at a few years removed.  

Training Progress 

To get a better picture of my progress throughout the program, three primary variables came into focus: Pace measured in miles per hour (mph); Distance, measured in miles; and Training Session, numbering 1 to 24 and completed in order. Running paces ranged from 5.09 to 6.58 mph (11:47 min/mile to 9:07 min/mile), with a mean of 5.83 mph ( 10:18 min/mile), while distances ran ranged from 2.4 to 5.43 miles, with a mean of 3.44 miles per run. Since there is an ordered component to these workouts (by session), progress can be visualized through scatter plots. Below, are plots of running distance and pace over the course of the 24 workout sessions. Notice how the spread between data opens up as training progresses, especially with respect to distance ran. This “fanning effect” would normally be problematic in statistics, but for running this is often a desired feature in training: 

Image by Author

Workout Type

As I mentioned above, the biggest takeaway of the program for me was my understanding of pacing. Interval sessions, tempo runs, and base runs, require very different kinds of efforts, all of which can improve performance. Interval sessions remained the most consistent with respect to running pace, but had the largest range and highest average number of miles ran. Tempo runs and base runs remained relatively consistent in terms of mileage, with tempo runs having the widest range along with the highest average running pace. These findings can be better visualized through the box plots below for both paces and distances ran:

Comparing with C25K

In my previous blog post, we went through the data of the C25K program.  Since both of these trainings were focused on the same distance, I thought it would be fun to compare progress side by side on the primary variable of interest, pace. The C25K program had a range of 4.01 to 5.51 mph, with an average of 4.79 mph, while the Tempo Runner program had a range of 5.09 to 6.58 mph, with an average of 5.83 mph. Given that both programs had a sequential component (i.e. “training session”), these data can also be expressed as a regression. Below are box plots of running pace distributions (left) and scatter plots of running pace throughout training (right) for both programs. Notice how the Faster 5K program is noticeably higher on average than the C25K program, while the C25K program has a more positive slope. Since the Couch to 5K programs designed to take runners from sedentary to being able to complete a 3.1 mile run, there is naturally going to be much greater gains (i.e. higher slope) in the beginning, with later improvements occurring more incrementally:

Image by Author

Final Thoughts

The Tempo Runner: 5K plan on the runtracker app is geared towards regular runners who can currently run a 5K and are interested in improving performance. The training stricture consists of three runs per week with one interval session, one tempo run, and one 35-minute steady state run. The variety of workouts in the program are designed to build both aerobic (endurance) and anaerobic (speed) capacity in runners. For anyone who is new to running, or hasn’t had structured training before, this program could be an excellent introduction. 

Below are some links related to improving 5K times, along with the raw data and code used to create the charts and analysis.  If you are interested in my experience with Couch to 5K, you can find that post here and for my first half marathon, you can find that here.

Thanks for reading! 

Resources & Code:

# FRONT MATTTER

### Note: All packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

# clean up (this clears out the previous environment)
ls()

# Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)
library(hrbrthemes)
library(viridis)

# Likert Data Packages
library(psych)
library(FSA)
library(lattice)
library(boot)
library(likert)

#install.packages("wordcloud")
library(wordcloud)
library(tm)
library(wordcloud)


# Grid Extra for Multiplots
library("gridExtra")

# Multiple plot function (just copy paste code)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}



# FASTER 5K

# Data Intake

Faster5K <- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/master/Faster5k.csv")

Faster5K <- Faster5K %>%
  na.omit()

Faster5K

# Plot 1 - Density Plot of Running Distances

p1 <- ggplot(Faster5K, aes(x=Distance)) + 
  geom_density(color="light blue", fill="Pink") + labs( x ="Distance (Miles)", y = "", title = "Running Distances",  subtitle = "Tempo Run: 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

p1

# Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(Faster5K, aes(x=Pace_MPH)) + 
  geom_density(color="Pink", fill="light blue") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Running Speeds",  subtitle = "Tempo Run: 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

p2

# Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

# Plot 3 - Density Plot of of Running Distance over Time

p3 <- ggplot(Faster5K, aes(x= Session, y= Distance)) + geom_point(color="Purple") +  geom_smooth(method=lm , color="Green", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Running Distance",  subtitle = "Tempo Run: 5K Training Plan", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p3

# Plot 4 - Density Plot of of Running Speed over Time

p4<- ggplot(Faster5K, aes(x=Session, y= Pace_MPH)) + geom_point(color="Green") +  geom_smooth(method=lm , color="Purple", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Running Speed",  subtitle = "Tempo Run: 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

p4

# Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

# Summary Statistics of Distance
favstats(Faster5K$Distance)

# Summary Statistics of Pace
favstats(Faster5K$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(Faster5K$Session, Faster5K$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(Faster5K$Session, Faster5K$Pace_MPH, method = "pearson")


# Pearson Product Correlation of Pace over Time (session)
cor.test(C25K$Session, C25K$Pace_MPH, method = "pearson")

# Simple Linear Model of Pace & Session
Distance <- lm(Distance ~ Session, data = Faster5K)
summary(Distance)

# Simple Linear Model of Pace & Session
Speed <- lm(Pace_MPH ~ Session, data = Faster5K)
summary(Speed)


# Import data from CSV, no factors

Plans_5K <- read.csv("5K_Plans.csv",  stringsAsFactors = FALSE)

Plans_5K

# Plot
p7 <- Faster5K %>%
  ggplot( aes(x=Workout, y= Distance, fill=Workout)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Distance (Miles)", title = "Distance by Workout",  subtitle = "Tempo Run: 5K Running Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Greens")
  

# Plot
p8 <- Faster5K %>%
  ggplot( aes(x=Workout, y= Pace_MPH, fill=Workout)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="", y = "Speed (Miles per Hour)", title = "Speed by Workout",  subtitle = "Tempo Run: 5K Running Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="Purples")


# Combine plots using multi-plot function
multiplot( p7, p8, cols=1)

# Combine plots using multi-plot function
multiplot( p7, p8, cols=2)


# Combine plots using multi-plot function
multiplot( p1, p7, cols=2)


# Combine plots using multi-plot function
multiplot( p2, p8, cols=2)
aggregate(Faster5K$Workout, list(Faster5K$Pace_MPH), FUN=mean) 


# Summarize Mean Distance & Pace by Workout Type
Faster5K  %>%
  group_by(Workout) %>%
  summarise_at(vars(Distance, Pace_MPH), list(Average = mean))

Plans_5K  %>%
  group_by(Program) %>%
  summarise_at(vars(Distance, Pace_MPH), list(Average = mean))

# Plot
p5 <- Plans_5K %>%
  ggplot( aes(x=Program, y= Pace_MPH, fill=Program)) +
  geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Comparing Paces",  subtitle = "C25K & Tempo Run: 5K Training Plans", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="BuPu")

p5

# Plot
p6 <- Plans_5K %>%
  ggplot( aes(x=Program, y= Distance, fill=Program)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="Black", size=0.4, alpha=0.9) + 
  labs( x ="Training Session", y = "Distance (Miles)", title = "Comparing Distances",  subtitle = "C25K & Tempo Run: 5K Training Plans", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank(),
    legend.position = "none") +
    scale_fill_brewer(palette="PRGn")

p6


multiplot( p5, p6, cols=2)

t.test(Pace_MPH ~ Program, data = Plans_5K)

t.test(Distance ~ Program, data = Plans_5K)

# Plot

p10 <- ggplot(Plans_5K, aes(x=Session, y= Pace_MPH, color = Program )) + geom_point() +  geom_smooth(method=lm , se=TRUE,aes(color=Program)) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Pace Through Training",  subtitle = "C25K & Tempo Run: 5K Training Plans", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank()) + 
  scale_color_manual(values=c('blue', 'orange'))+
  theme(legend.position="none")


p10


multiplot( p5, p10, cols=2)

Running Through the Data: C25K

I am not typically one for New Years resolutions, but in 2020 I made a really small one: keeping better track of my activity (using my Apple Watch). I thought that by simply measuring my activity, it might result in an increase overall. I was pretty sedentary at the beginning, but after a few weeks of tracking I saw a noticeable improvement in activity level, and I felt better. So, I decided to see if I could raise the bar a bit by completing the Couch to 5K program, using the C25K  running app.

c25K Training App by Active

The C25K running app is based on Josh Clark’s running programs, which scaffolds participants through a series of manageable expectations. The training plan includes 3 runs per week – each between 20 and 30 minutes – with the program lasting 9 weeks in total. The most noticeable feature of the training plan is it combines both running and walking. Over the course of the 27 training runs, the proportion of walking decreases while the proportion of running increases, culminating with three 30 minute runs in the last week of the program. 

C25K Training Plan Example (Week 1, Day 1)

This was my second time using the C25K program. My first time trying the app, I completed it, generally enjoyed it, and even ran a few 5K’s afterwards. However, I ended up getting hurt / burned out, and within a few years was definitely back to square one. This time, I made my primary goal to stay injury and pain free, so I focused more closely on listening to my body, slowing down, and taking rest as needed. Since I am still running today, I decided to take a look back at those training runs and share the data with anyone who is interested:

Speed, Distance, & Progress

The two most obvious variables to look at were speed and distance. The distances ran throughout this program ranged from 2.01 to 3.74 miles, with an average of 2.65 miles per run. Running speed ranged from 4.02 (14:55 min/mile) to 5.51 mph (10:53  min/ mile), with an average of 4.79 mph ( 12:31 min/mile). The distributions of my runs by distance and speed for the C25K program can be seen in the density plots below:

Distances & Speeds Ran in C25K Training Plan (Fall, 2020)

Since people are generally more interested in seeing progress, below are scatter plots of distances covered and running speeds over the course of the 27 training runs. At first glance, we see a strong positive association with training volume (mileage) and intensity (speed) throughout the duration of the training program. When you take a closer look at both scatter plots, you see clear cycles ebbing and flowing along the positive slopes. Most training plans are designed to take on this kind of shape, so neither of these results are surprising:

Running Progress (Distance & Speed), Fall 2020

Looking back at the data two years removed, a number of interesting things stand out to me. The first one is how tightly packed, and predictable, the data is. Both speed and distance remain very similar in adjacent runs. This is how the program is designed, and completely makes sense when developing a fitness base. However, most of the training I do now is very different than that. Speed and distance vary widely from run to run, to allow for different kinds of stress and recovery. The second novel finding was how strong the slope was for both variables. When first starting out, the good news is you are probably going to improve very quickly – although it may not feel like it at the time. The longer you run, the rate of improvement slows down considerably. Most of my work now as a runner is built on slow, gradual gains; so improvement like this over this short of a period would put me at risk of injury now. The key difference for me now is I can run much further distances and have a much higher top speed, but the rate of progress is far less noticeable.

Final Thoughts

With millions of downloads, the C25K app has consistently been one of the most popular training apps for new runners, and for good reason. Based on a series of running plans developed by Josh Clark’s in the 90’s, the C25K training plans are structured to build runners up slowly, using a run / walk method. Whenever I talk with people who are interested in starting a running routine, one of the first things I recommend is they get this app, primarily because it employs the run / walk method. Many people think running should not include walking, or that walking is cheating or a sign weakness. Objectively, it is not. The longer you run, the more important it is that you find your ideal pace, in order to keep your heart down, breathing under control, and good running form. The run / walk method accomplishes this by slowly increasing the proportion of running to walking over time. Also, you would be freaked out by how fast and how far some people who use the run walk method are.

A couple of words of caution about the program though. First and foremost, no one training app is going to fit everyone. Depending on current level of fitness and variety of other factors, the training program may take longer than 9 weeks. One of the most consistent pieces of advice you will find on the C25K program is that you should not be afraid to repeat runs, repeat weeks, or add extra rest if your body needs it. I couldn’t agree with this more. There are a few times when the increase in running volume felt like a lot (week 5, for example), so don’t be scared to slow down or add some extra rest. Definitely don’t skip ahead or run back to back days. The app is built so you will get faster and you will run further as you progress through the program. That’s baked in, but none of that will matter if you get hurt. Increasing speed or volume too quickly is the faster way to injury, but if you listen to your body and aren’t afraid slow down (i.e. walk more), then C25K could be a great way to get started. 

Below are some links to C25K reviews, along with the raw data and code used to create the charts and analysis. For my next post, I plan to break down the data for the Faster 5k Training Plan that I used to shave a few minutes off my 5k time by introducing speed work.

Thanks for reading!  

Resources & Code:

C25K Running Data can be found here. The code I used (in R) to create plots and analysis is below:

# FRONT MATTTER

### All packages can be downloaded using the install.packages() function. This only needs to be done once before loading. 

## Load Packages 
library(tidyverse)
library(wordcloud2)
library(mosaic)
library(readxl)

## Grid Extra for Multiplots
library("gridExtra")

## Multiple plot function
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

# COUCH TO 5K

# Data Intake

C25K<- read.csv("https://raw.githubusercontent.com/scottatchison/The-Data-Runner/8c1162e60a0c3af4e900ed38c222304da1542cb9/Half_1_2.csv")

C25K

## Plot 1 - Density Plot of Running Distances

p1 <- ggplot(C25K, aes(x=Distance)) + 
  geom_density(color="Green", fill="Purple") + labs( x ="Distance (Miles)", y = "", title = "Distribution of Running Distances",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Plot 1 - Density Plot of of Running Speeds

p2 <- ggplot(C25K, aes(x=Pace_MPH)) + 
  geom_density(color="Purple", fill="Green") + 
  labs( x ="Speed (Miles per Hour)", y = "", title = "Distribution of Running Speeds",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14),
    plot.caption = element_text(hjust = 1, face = "italic"), 
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.background = element_blank())

## Combine plots using multi-plot function:

multiplot( p1, p2, cols=1)

## Plot 3 - Density Plot of of Running Distance over Time

p3 <- ggplot(C25K, aes(x=Session, y= Distance)) + geom_point(color="blue") +  geom_smooth(method=lm , color="red", se=TRUE) + labs(x ="Training Session", y = "Distance (Miles)", title = "Progression of Running Distance",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
   theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Plot 4 - Density Plot of of Running Speed over Time

p4<- ggplot(C25K, aes(x=Session, y= Pace_MPH)) + geom_point(color="red") +  geom_smooth(method=lm , color="blue", se=TRUE) + labs( x ="Training Session", y = "Speed (Miles per Hour)", title = "Progression of Running Speed",  subtitle = "Couch to 5K Training Plan", caption = "Data source: TheDataRunner.com") +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 14), 
    plot.caption = element_text(hjust = 1, face = "italic"),
    panel.background = element_blank())

## Combine plots using multi-plot function
multiplot( p3, p4, cols=1)

## Summary Statistics of Distance
favstats(C25K$Distance)

# Summary Statistics of Pace
favstats(C25K$Pace_MPH)

# Pearson Product Correlation of Distance over Time (session)
cor.test(C25K$Session, C25K$Distance, method = "pearson")

# Pearson Product Correlation of Pace over Time (session)
cor.test(C25K$Session, C25K$Pace_MPH, method = "pearson")

# Simple Linear Model of Pace & Session
Distance <- lm(Distance ~ Session, data =C25K)
summary(Distance)

# Simple Linear Model of Pace & Session
Speed <- lm(Pace_MPH ~ Session, data =C25K)
summary(Speed)

The Problem With Polling

I have gotten a lot of questions about political polls lately and I have found myself having the same conversation over and over about the reliability of polling in general. Those conversations have centered around the concept of Total Survey Error (TSE), or all the different ways that a survey can go wrong. So, I thought I would take a break from what I should be doing and write about the five basic forms of TSE.

Tallying the results of the presidential election on Nov. 2, 1948.
PHOTO: CBS PHOTO ARCHIVE/GETTY IMAGES

Coverage Error – This form of error is when your sampling frame (i.e. your list of people to potentially poll) does not accurately represent the population you are measuring. For example, if you want to conduct a poll by phone and your list only includes landlines, then you are leaving out everyone who does not have a landline. Dewey Defeats Truman is an example that fits this kind of TSE. They only surveyed people with telephones that year (1948), who were typically far wealthier and more likely to vote for Dewey, rather than Truman, the eventual winner. This coverage issue led to non-response bias (see below)

Specification Error – This error occurs when what is being measured isn’t clear. Typically, this is reserved for psychological constructs, which are oftentimes multidimensional. A political example of this would be ideology. We know that most people’s political beliefs lie along a spectrum, and those beliefs may be nuanced and context dependent. The Pew Research Center has an excellent example of measuring ideology as a construct. Fortunately, there is an easy way around this for political polls: ask them specifically which candidate(s) they are voting for.

Response Error – This form of bias has to do with who responded to the poll, and relatedly, who didn’t respond to the poll. This can be unit response (i.e. someone refuses to participate) or item response (i.e. someone refuses to answer a specific question). Again using a phone poll example: if you had a list of all numbers (cell phones and landlines) that you use to call on your poll, people with caller ID are less likely to pick up. Well, almost all cell phones have caller ID built in. This means that people with landlines – which are typically older people – are more likely to answer; younger people, less so.

Measurement Error – This form of error is probably the most well studied in the world of survey methodology, because it has so many parts to it. The order of the questions being asked, the tone of the interviewers voice or appearance, the wording of the questions themselves may unintentionally cause someone to answer a certain way. For example, I have seen many projections based solely on party identification, which does not account for people who plan on voting for one party in every race except one (i.e. “ticket splitters“). I imagine there will be a large number of people who cast their votes for all but one member of their preferred party this election. If you want to see an example of how not to predict an outcome, I humbly submit this one as an example of both specification error and measurement error.

Processing Error – Processing error is all the ways that things can go wrong with the data AFTER it is collected. Some forms of this occur in encoding, editing, and weighting. The weighting piece is especially tricky, because it adjusts results based on known population parameters. For example, if we know that a poll had 80% of its respondents to be female, we would need to adjust the weights of the males in the survey to account for the fact that population parameter is known to be roughly 50%. Now, imagine that we are also accounting for race, income, education level, and age; you will see that things can get complicated in a hurry. One strategy to account for this is an iterative approach, known as “raking

Supporters of presidential candidate Hillary Clinton watch televised coverage of the U.S. presidential election at Comet Tavern in the Capitol Hill neighborhood of Seattle on Nov. 8. (Photo by Jason Redmond/AFP/Getty Images)

So, what does all this mean? There are lots of ways things can go wrong, and good surveys are incredibly expensive. They take time to construct and a shocking amount of money and manpower to collect. Also, many political polls are collected to drive media viewership, which means they are often more concerned about expediency rather than accuracy. That right there should be enough to give you pause.The 2016 election gave polling – and to a certain extent, statistics – a bad name. However, people don’t realize that the national polls (i.e. popular vote) were right on the money. The popular vote is one model. The electoral college tally is 51 models (all 50 states plus DC), which may take different strategies for collecting and analyzing, depending on the state. Lots of room for mistakes. If we want to predict who will likely win the popular vote, the statistical evidence that Biden will win that is pretty solid. Does that mean it is a certainty? Objectively no. Of course the election is decided off the electoral college, which again is 51 separate models. Some of those states are pretty clear. Others, not so much.

“A margin of error of plus or minus 3 percentage points at the 95 percent confidence level means that if we fielded the same survey 100 times, we would expect the result to be within 3 percentage points of the true population value 95 of those times.”

5 key things to know about the margin of error in election polls

Finally, it appears that we are headed for record levels of turnout due in part to enthusiasm, mail in voting, COVID-19, etc. The unprecedented nature of these factors only makes polling even more fraught for potential error. I would encourage anyone following the polls closely to lower their expectations considerably. That doesn’t mean the polls are wrong, but they should be viewed with a healthy amount of circumspection. With that being said, if you are like me and cannot help yourself, look at Nate Cohn and Nate Silver’s stuff. It is typically the most robust and transparent. Not surprisingly, they are the often times the most accurate predictions.

Tl;dr – Ignore the polls. We won’t really know much of anything until we see actual vote totals being counted. The rest is just theater.

EDA: Open & Closed Data

Introduction

Each summer, nearly 8,000 incoming students attend New Student Orientation (NSO) at Penn State’s University Park campus. At the conclusion of NSO, each student is sent a survey to gather their perspectives on various aspects of their experiences at their respective sessions. Questions range from their experiences at check-in to their understanding of student services and various initiatives, such as Penn State’s commitment to diversity and inclusion. These data provide an opportunity to asses which aspects of NSO warrant further exploration.

Cleaning & Inspecting the Data

Survey results were provided from the office of New Student Orientation for the sessions occurring in the summers of 2017,2018, & 2019. Each of these databases were inspected to find which variables were consistent across all three spreadsheets. Variables that were not consistent across each survey were discarded combined into one master database, coded by their respective years. The variables that were consistent across all three years were:

  • Leader Connection – The extent to which a meaningful connection was made with their Orientation Leader during NSO.
  • Substances -The extent to which their understanding changed related to the consequences of alcohol and drug use and abuse during NSO.
  • Assault Resources – The extent to which their understanding changed as a result of attending NSO related to reporting and support services Penn State provides for victims of sexual harassment and sexual assault.
  • Bystander Prevention – The extent to which their understanding changed as a result of attending NSO related to how to handle dangerous situations.
  • Health Resources – The extent to which their understanding changed as a result of attending NSO related to support services Penn State provides for mental health, physical health, and personal well-being.
  • Safety Resources – The extent to which their understanding changed as a result of attending NSO related to support services Penn State provides to help keep me safe.
  • Diversity / Inclusion – The extent to which their understanding changed related to the importance of diversity and inclusion on our campus.
  • Definition of Consent – An open-ended survey question asking participants to define the term “consent.”

The final step in cleaning the data was to remove any personable identifiable information and missing values. Basic demographic information, such as race, gender, sexual orientation, resident status, and matriculation date were maintained, but not utilized for this analysis. Any observations that broke off from the survey prior to answering the 8 variables of interest were discarded as well.

Exploratory Data Analysis

Once data were gathered and cleaned, an exploratory data analysis was conducted to examine patterns in the data. Survey questions that utilized a Likert scale were compared against one another, revealing similar, right-skewed distributions on each of the factors, with the exception of leader connection, which was more widely distributed amongst Likert responses (figure 3.1). The leader connection variable, when examined in a bar chart, grouped by year (figure 3.2) showed the widest variety of distributions in comparison the remaining variables visualized in the same way (figure 3.3). Since Likert scale data is ordinal in nature, a Kruskil-Wallace test was conducted on each variable to examine differences by year, followed by a post-hoc analysis using the Dunn-Bonferroni correction to reveal where differences may occur. Each variable showed an upward trend in Likert responses over time, with with statistically significant differences (α < .05) in each variable over time, with the exception of the variable measuring the importance of diversity and inclusion at Penn State.

The open ended survey question asking for a definition of consent revealed interesting results across the three surveys. In 2017 (figure 3.4) & 2018 (figure 3.5), the top two words used to define consent were the words “consent” and “yes,” respectively. However, in 2019 (figure 3.6), the top two words were “given” and “freely.” It is known that the 2019 version of the Results Will Vary interactive play that all NSO students see featured a production related to consent. In this scene, the an acronym F.R.I.E.S is used to represent that consent must be freely given, reversible, informed, enthusiastic, & specific. Each of these words, along with the aforementioned acronym all occur in the top 10 words for the 2019 survey, while none of them were found in the top 10 of the previous two NSO years.

Likert Data Comparison (figure 3.1)

Leader Connection (FIGURE 3.2)

Protocols, Services, & Resources (FIGURE 3.3)

2017 Open Ended Data (FIGURE 3.4)

Frequency chart of top 20 Words (FIGURE 3.5)

2018 Open Ended Data (FIGURE 3.6)

Frequency chart of top 20 Words (FIGURE 3.7)

2019 Open Ended Data (FIGURE 3.8)

Frequency chart of top 20 Words (FIGURE 3.9)

Conclusion / Suggestions for the Future

While it is difficult to make inferences on observational data, we can see some trends towards greater understanding from students who attend NSO at Penn State’s University Park campus. These increases in understanding could be due to any variety of factors, including changes in the population of interest (e.g. incoming students) or a variety of areas within the NSO experiences. The open ended survey data defining consent showed the clearest picture of the differences between year with the 2019 data pointing clearly towards connections made with a scene dedicated to consent in the Results Will Vary interactive play.

To gain a greater understanding of student’s perceptions of new student orientation, a variety of opportunities exist. A clear definition of what kinds of insights you would like to gain from students regarding NSO should inform question formation. For example, an argument could be made that leader connection is a multi-dimensional construct that cannot be measured accurately with one question. Consequently, leader connection was the variable with both the lowest score and the widest distribution of responses among participants; further investigation is warranted. Finally, the use of open-ended survey responses could provide a wealth of feedback on specific initiatives, particularly if they are formed in conjunction with specific experiences during NSO. For example, the F.R.I.E.S scene from the Results Will Vary interactive play demonstrated clear connections towards changes in understanding of consent. Future new student orientations could benefit by exploring these connections in other topics covered within Results Will Vary to measure both the effectiveness of play and the perceptions of students.