Since 1985, the U.S. News and World Report has collected, compiled, and published a list of the top colleges and universities around the country. This report is based on annual surveys sent to each school as well as general opinion surveys of university faculties and administrators who do not belong to the schools on the list. These rankings are among the most widely quoted of their kind in the United States and have played an important role among students making their college decisions. However, other factors may prove to be meaningful when making these decisions. For example, is cost of tuition associated with the ranking of a university? Said another way, do “better” schools cost more money to attend? Are other factors, such as enrollment, state population, cost of living, and region associated with these rankings? For potential students looking to get ahead in a global economy, these may be important considerations, especially for those who come from lower socioeconomic backgrounds.
Objectives & Variables of Interest
The purpose of this study is to investigate the associations between tuition, enrollment, cost of living, population, and region of the country on the 2017 US News & World Report’s Best College Rankings. The variables of interest are university ranking, undergraduate tuition, undergraduate enrollment, cost of living index by state, state population according to 2017 Census data, and region of the country (Northeast, Midwest, South, & West). The response variable is ranking, and the potential explanatory variables are undergraduate tuition, undergraduate enrollment, cost of living index, state population, and region of the country.
Cost of living (COL) data was available by state instead of by community. A university that is located in a community with a high cost of living may be in a state with an overall low COL index score (and vice versa), which eliminates some precision in our prediction. In addition, many schools chose not to participate in this ranking report, which introduces non-response bias into the design.
Exploratory Data Analysis
The first step in exploratory data analysis was to look at the shape of each of the variables from a univariate standpoint (not pictured in this analysis). From there, I explored the associations between continuous variables, represented in the correlation matrix and the correlation plots below:
Looking at the data, we see some moderate to strong associations between rank, tuition, enrollment that warrant further investigation. Before building models, the data was explored by comparing some of these associations by region:
Finally, we can see the strongest association between two variables by visualizing Rank vs Tuition. When coded by Region we see some slight curvature in the data, but a similar negative shape and slope across parts of the United States:
Summary & Initial Observations of EDA:
Within the explanatory variables, we see a strong association between cost of tuition and ranking in the US News and World Report metric. One outlier (Brigham Young University–Provo) reports a relatively high ranking (68) in comparison to its tuition ($5,300). This observation appears to be influencing the line of best fit, lowering the correlation coefficient. Even with that influential data point, we still see a strong negative correlation (-.75) between tuition and rank. Enrollment does not appear to have a significant effect on university ranking, but it does appear to be positively associated with tuition (.37) cost of living (-.19). Cost of Living Index (COL) and state population appear to have a weak, negative assocation with university ranking, and a moderate positive association with the cost of tuition. The data may indicate an interaction between some of the explanatory variables, such as tuition, cost of living, enrollment, and rank, warranting further investigation.
Relationship between variables is summarized in the heat map below:
Through this EDA, we have a better understanding of the shape and relationships among the variables, which should inform model construction and analysis. The second part of this project tackles those objectives, found here. Code for the plots above can be found below:
R Code Chunks
# Correlation Matrix cor(DATA) # Plot 1:Corelation Plot plot(DATA) # Plot 2: Boxplots of Ranking (by Region): ggplot(DATA, aes(x=Region, y=Rank, fill=Region)) + geom_boxplot(alpha=0.3) + theme(legend.position="none") # Plot 3:Interactive Plot of Tuition vs Rank (by Region) ggplot(data = DATA, aes(x = Tuition, y = Rank)) + geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) + geom_smooth(aes(colour = Region, fill = Region)) # Plot 4:Interactive Plot of Enrollment vs Rank (by Region) p <- ggplot(data = DATA, aes(x = Enrollment, y = Rank)) + geom_point(aes(text = paste("Enrollment:", Enrollment)), size = .5) + geom_smooth(aes(colour = Region, fill = Region)) # Plot 5: Heat Map Correlation Plot heat <- cor(DATA) corrplot(heat, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)