2017 US News Rankings (Part 1)

Since 1985, the U.S. News and World Report has collected, compiled, and published a list of the top colleges and universities around the country. This report is based on annual surveys sent to each school as well as general opinion surveys of university faculties and administrators who do not belong to the schools on the list. These rankings are among the most widely quoted of their kind in the United States and have played an important role among students making their college decisions. However, other factors may prove to be meaningful when making these decisions. For example, is cost of tuition associated with the ranking of a university? Said another way, do “better” schools cost more money to attend? Are other factors, such as enrollment, state population, cost of living, and region associated with these rankings? For potential students looking to get ahead in a global economy, these may be important considerations, especially for those who come from lower socioeconomic backgrounds.

Objectives & Variables of Interest

The purpose of this study is to investigate the associations between tuition, enrollment, cost of living, population, and region of the country on the 2017 US News & World Report’s Best College Rankings. The variables of interest are university ranking, undergraduate tuition, undergraduate enrollment, cost of living index by state, state population according to 2017 Census data, and region of the country (Northeast, Midwest, South, & West). The response variable is ranking, and the potential explanatory variables are undergraduate tuition, undergraduate enrollment, cost of living index, state population, and region of the country.

Concerns

Cost of living (COL) data was available by state instead of by community. A university that is located in a community with a high cost of living may be in a state with an overall low COL index score (and vice versa), which eliminates some precision in our prediction. In addition, many schools chose not to participate in this ranking report, which introduces non-response bias into the design.

Exploratory Data Analysis

The first step in exploratory data analysis was to look at the shape of each of the variables from a univariate standpoint (not pictured in this analysis). From there, I explored the associations between continuous variables, represented in the correlation matrix and the correlation plots below:

Correlation Matrix (2017 US News & World Report School Rankings)
Correlation Plot (2017 US News & World Report School Rankings)

Looking at the data, we see some moderate to strong associations between rank, tuition, enrollment that warrant further investigation. Before building models, the data was explored by comparing some of these associations by region:

Boxplots of Rank by Region (2017 US News & World Report School Rankings)
Scatter Plot of Rank vs Enrollment, coded by Region (2017 US News & World Report School Rankings)

Finally, we can see the strongest association between two variables by visualizing Rank vs Tuition. When coded by Region we see some slight curvature in the data, but a similar negative shape and slope across parts of the United States:

Scatter Plot of Rank vs Tuition, coded by Region (2017 US News & World Report School Rankings)

Summary & Initial Observations of EDA:


Within the explanatory variables, we see a strong association between cost of tuition and ranking in the US News and World Report metric. One outlier (Brigham Young University–Provo) reports a relatively high ranking (68) in comparison to its tuition ($5,300). This observation appears to be influencing the line of best fit, lowering the correlation coefficient. Even with that influential data point, we still see a strong negative correlation (-.75) between tuition and rank. Enrollment does not appear to have a significant effect on university ranking, but it does appear to be positively associated with tuition (.37) cost of living (-.19). Cost of Living Index (COL) and state population appear to have a weak, negative assocation with university ranking, and a moderate positive association with the cost of tuition. The data may indicate an interaction between some of the explanatory variables, such as tuition, cost of living, enrollment, and rank, warranting further investigation.

Relationship between variables is summarized in the heat map below:

Through this EDA, we have a better understanding of the shape and relationships among the variables, which should inform model construction and analysis. The second part of this project tackles those objectives, found here. Code for the plots above can be found below:

R Code Chunks

# Correlation Matrix
cor(DATA)

# Plot 1:Corelation Plot
plot(DATA)

# Plot 2: Boxplots of Ranking (by Region):
ggplot(DATA, aes(x=Region, y=Rank, fill=Region)) + 
    geom_boxplot(alpha=0.3) +
    theme(legend.position="none")

# Plot 3:Interactive Plot of Tuition vs Rank (by Region)
ggplot(data = DATA, aes(x = Tuition, y = Rank)) +
  geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) +
  geom_smooth(aes(colour = Region, fill = Region))

# Plot 4:Interactive Plot of Enrollment vs Rank (by Region)
p <- ggplot(data = DATA, aes(x = Enrollment, y = Rank)) +
  geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) +
  geom_smooth(aes(colour = Region, fill = Region))

# Plot 5: Heat Map Correlation Plot
heat <- cor(DATA)
corrplot(heat, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Exploratory Data Analysis: Spotify Song Popularity

Who doesn’t want to know what makes a song popular? As a musician, I have spent decades trying to get an answer to that question, with little success. People just seem to like what they like. So, when I enrolled in an applied statistics course that took a deep-dive into regression analysis, I got my chance. We were required to conduct an exploratory data analysis (EDA) on a data set of our choosing, but we had to go out and find it. This led me to Kaggle and Spotify’s million song dataset.

The purpose of this EDA was to investigate what variables may influence song popularity while developing a greater understanding of statistical procedures. More specifically, the following questions were to be addressed:

  1. What variables are associated with popularity of song choice by Spotify users? 
  2. Is one variable associated with popularity above others? 
  3. If there is an association, is it linear? 

My intuition before conducting analysis was that danceability, energy, and valence will be the most highly associated with song popularity, but not necessarily in a linear manner (or in that order).

Description of Data

The original dataset included 228,159 observations and 17 variables that describe the Spotify Tracks Database created by Tim Igolo and posted on Kaggle.com. Data was harvested through Spotify for Developers in April of of 2019. Unfortunately, the data included music from soundtracks in addition to “movie music” in addition to Opera (but not classical) and a number of other musical styles that could make any kind of regression analysis difficult. So, I filtered the data to include only popular music types, resulting in 130,663 observations (i.e. songs), with 11 variables of interest.  Those variables of interest are popularity, acousticness, danceability, duration (in milliseconds), energy, liveness, loudness, mode (major or minor), speechiness, tempo, & valence. The response variable is popularity and the potential explanatory variables are acoustics, danceability, duration, energy, liveness, loudness, mode, speechiness, tempo, & valence. 

Exploratory Data Analysis (EDA)

Originally proposed by John Tukey, the inventory of the Tukey Test,  EDA’s are typically used to summarize summarize the data’s main characteristics. This can be through simple summary statistics (measures of central tendency, five number summary, etc), but often includes data visualization as well. Simply put, we use EDA’s to look for any patterns or problems in the data and right away I found one:

This variable indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is. As you can see, there are more songs in minor (79409), than in major (51254), and the mode does not appear to have an appreciable effect on the popularity of a song:

While these two visualizations seem pretty clear, there is one glaring issue: Most songs in popular music are neither major, nor minor. Instead, they are modal and they sometimes shift tonalities throughout, so categorizing all of these songs dichotomously is a problem. Upon further investigation, you see some songs listed multiple times and in multiple categories. This is almost certainly due to how Spotify classifies their songs, which is fine, but poses problems when trying to investigate the questions stated above. 

Adjusting Plots for Better Visualization:

I was very new to using R at this point and didn’t have the time to sort through these issues, so I ended up choosing another dataset to analyze, but I figured while I had this data I would go ahead and figure out some ways to visualize it. Fortunately, this dataset did provide some interesting obstacles to overcome with data visualization. 

The first problem was the sheer number of observations.  I used the mplot function within the Mosaic package for many of these initial plots, which is super convenient for beginners in R.  However, since the dataset was so large, many of the plot were not helpful, like these two below:

Correlation Plot (i.e. corrplot) of Continuous Variables
Popularity with respect to Speechiness in Spotify Dataset

These first two plot show the relationships between continuous variables, but are not the most helpful. We can get a sense on some linear relationships in the data, but it’s pretty tough to really see what is going on simply due to the number of data points. So, a solution was to use a heat plot instead to show correlations across continuous variables instead:

Heat Plot of Continuous Variables

Similarly, the plot below of these two variables is much more clear as to what patterns are in the data simply by changing the size and transparency of the data points themselves:

Popularity with respect to Speechiness in Spotify Dataset (adjusted size & transparency of data points)

While this was an early foray for me into R and not a dataset I wanted to investigate further to make inferences from, it did provide some interesting obstacles to overcome on how to use the software to visualize the data. Ultimately, my interests navigated to a different question: Investigating Factors for School Ranking in the US News & World Report, which can be found here and the code I used to create the plots above here:

R Code Chunks

# Plot 1: Correlation plot of Popularity vs Explanatory Variables
library(corrplot)
plot(DATA)

# Plot 2: Scatter Plot of Popularity vs Speechiness
library(mosaic)
library(ggplot2)
gf_point(Popularity ~ Speechiness, data = DATA) %>% 
gf_labs(title = "Popularity vs Speechiness", caption = "Spotify Dataset")

# Plot 3: Heat plot of Popularity vs Explanatory Variables
heat <- cor(DATA)
corrplot(heat, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

# Plot 4: Scatterplot of Speechiness vs. Popularity
DATA %>%
  ggplot( aes(x=Speechiness, y=Popularity)) +
    geom_point(color="darkblue", size=1.75, alpha=0.01) +
    theme(legend.position="none")

My Pathway to PBL

In 2008, I was hired to teach on the music faculty at Texas A & M – Commerce. Great job. Great people there. On my contract, the final sentence said “other duties as assigned,” and one of those dates ended up being a music technology course. I had no experience teaching a class like that and no formal training in a class like that, but that didn’t change the fact that I was going to teach it. So, I got started on figuring out how to do that and not feel like an idiot in the process.

The first step was to find an existing syllabus, of which there was none to be found. The next step was to ask the faculty what had been covered in the class in the past, which nobody really knew (this is not uncommon). There was a textbook that had been used, but it was not geared towards the population of students I was going to be teaching (according to my boss….and he wasn’t wrong). So, the next logical step was to get a sense from the faculty (and my bosses) what they would like me to cover in the course.

What I got was a laundry list of softwares and technologies students should be able to know, many of which were incredibly outdated or not relevant to the entire class. For example, being able to use a drill writing software like Pyware or EnVision is not going to be a meaningful exercise for people who intent to teach choir, orchestra, elementary, or middle school. It is also a really difficult to learn, because it requires knowledge in other areas besides just how to operate the software. This left me with these questions to resolve:

  1. What softwares and skills are transferable across all students in the class?
  2. What can we reasonably expect students to be able to know and be able to do within the confines of the semester and technological resources available.
  3. What kinds of activities and assessments can we create to achieve these goals?

The result was a series of projects that centered on 3 main areas of being a music teacher and how technology could be used to serve them. Those areas were teaching, creativity, and administration. Projects were designed to scaffold learners in a way that not only helped them gain knowledge and understanding of software, but how to leverage those skills against their own interests to create new knowledge and digital fluency. Some projects were the same across degree tracks (choral, instrumental, general music, etc), while others were specific to each tracks. Examples of projects that remained consistent across tracks included the missing part assignment and the digital audio assignments, while some of the projects that bifurcated were the orchestration projects and the final projects.

The result was something unexpected. Students got really into the projects and we all ended up helping each other learn together. Eventually I became the most knowledge person in the room with respect to the various technologies, but we never stopped learning together. With each passing year, I found that the more interesting and authentic assignments were and the more they allowed for students’ interests and creativity to come out, the more wonderful the projects became.

Years later, I came to understand that this approach was known as Problem Based Learning (PBL) and Authentic Context Learning (ACL). These approaches have become huge interests of mine across all domains of learning and have had some wonderful experiences in classes that use them in areas like the Applied Statistics program at Penn State. Now, as I finish my PhD in the time of Covid-19, I see the importance of creating meaningful projects that students and teachers can both learn from to increase engagement and understanding. Below are some examples of projects I have developed in the past for any of those that are interested:

Music Notation Projects:

  1. Recreate a Score
  2. Tranform a Score
  3. Creating & Composing with Music Notation Software

Digital Audio Projects:

  1. Recording & Editing Project
  2. Digital Audio Transformation Project
  3. Composing & Creating with Digital Audio Workstations

Music Administration:

  1. Data Cleaning Project
  2. Tracking Expenses
  3. Create a Mail Merge