Who doesn’t want to know what makes a song popular? As a musician, I have spent decades trying to get an answer to that question, with little success. People just seem to like what they like. So, when I enrolled in an applied statistics course that took a deep-dive into regression analysis, I got my chance. We were required to conduct an exploratory data analysis (EDA) on a data set of our choosing, but we had to go out and find it. This led me to Kaggle and Spotify’s million song dataset.
The purpose of this EDA was to investigate what variables may influence song popularity while developing a greater understanding of statistical procedures. More specifically, the following questions were to be addressed:
- What variables are associated with popularity of song choice by Spotify users?
- Is one variable associated with popularity above others?
- If there is an association, is it linear?
My intuition before conducting analysis was that danceability, energy, and valence will be the most highly associated with song popularity, but not necessarily in a linear manner (or in that order).
Description of Data
The original dataset included 228,159 observations and 17 variables that describe the Spotify Tracks Database created by Tim Igolo and posted on Kaggle.com. Data was harvested through Spotify for Developers in April of of 2019. Unfortunately, the data included music from soundtracks in addition to “movie music” in addition to Opera (but not classical) and a number of other musical styles that could make any kind of regression analysis difficult. So, I filtered the data to include only popular music types, resulting in 130,663 observations (i.e. songs), with 11 variables of interest. Those variables of interest are popularity, acousticness, danceability, duration (in milliseconds), energy, liveness, loudness, mode (major or minor), speechiness, tempo, & valence. The response variable is popularity and the potential explanatory variables are acoustics, danceability, duration, energy, liveness, loudness, mode, speechiness, tempo, & valence.
Exploratory Data Analysis (EDA)
Originally proposed by John Tukey, the inventory of the Tukey Test, EDA’s are typically used to summarize summarize the data’s main characteristics. This can be through simple summary statistics (measures of central tendency, five number summary, etc), but often includes data visualization as well. Simply put, we use EDA’s to look for any patterns or problems in the data and right away I found one:

This variable indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is. As you can see, there are more songs in minor (79409), than in major (51254), and the mode does not appear to have an appreciable effect on the popularity of a song:

While these two visualizations seem pretty clear, there is one glaring issue: Most songs in popular music are neither major, nor minor. Instead, they are modal and they sometimes shift tonalities throughout, so categorizing all of these songs dichotomously is a problem. Upon further investigation, you see some songs listed multiple times and in multiple categories. This is almost certainly due to how Spotify classifies their songs, which is fine, but poses problems when trying to investigate the questions stated above.
Adjusting Plots for Better Visualization:
I was very new to using R at this point and didn’t have the time to sort through these issues, so I ended up choosing another dataset to analyze, but I figured while I had this data I would go ahead and figure out some ways to visualize it. Fortunately, this dataset did provide some interesting obstacles to overcome with data visualization.
The first problem was the sheer number of observations. I used the mplot function within the Mosaic package for many of these initial plots, which is super convenient for beginners in R. However, since the dataset was so large, many of the plot were not helpful, like these two below:


These first two plot show the relationships between continuous variables, but are not the most helpful. We can get a sense on some linear relationships in the data, but it’s pretty tough to really see what is going on simply due to the number of data points. So, a solution was to use a heat plot instead to show correlations across continuous variables instead:

Similarly, the plot below of these two variables is much more clear as to what patterns are in the data simply by changing the size and transparency of the data points themselves:

While this was an early foray for me into R and not a dataset I wanted to investigate further to make inferences from, it did provide some interesting obstacles to overcome on how to use the software to visualize the data. Ultimately, my interests navigated to a different question: Investigating Factors for School Ranking in the US News & World Report, which can be found here and the code I used to create the plots above here:
R Code Chunks
# Plot 1: Correlation plot of Popularity vs Explanatory Variables
library(corrplot)
plot(DATA)
# Plot 2: Scatter Plot of Popularity vs Speechiness
library(mosaic)
library(ggplot2)
gf_point(Popularity ~ Speechiness, data = DATA) %>%
gf_labs(title = "Popularity vs Speechiness", caption = "Spotify Dataset")
# Plot 3: Heat plot of Popularity vs Explanatory Variables
heat <- cor(DATA)
corrplot(heat, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Plot 4: Scatterplot of Speechiness vs. Popularity
DATA %>%
ggplot( aes(x=Speechiness, y=Popularity)) +
geom_point(color="darkblue", size=1.75, alpha=0.01) +
theme(legend.position="none")