The U.S. News and World Report has collected, compiled, and published a list of the top colleges and universities around the country. This report is based on annual surveys sent to each school as well as general opinion surveys of university faculties and administrators who do not belong to the schools on the list. These rankings are among the most widely quoted of their kind in the United States and have played an important role among students making their college decisions. However, other factors may prove to be meaningful when making these decisions. The data may indicate an interaction between some of the explanatory variables, such as tuition, cost of living, enrollment, and rank, warranting further investigation:
The data consists of 222 observations, with 8 variables that describe the 2017 edition of the US News Universities Rankings as well as the cost of living population by state based on the US Census Bureau predictions for 2017. The databases for this analysis are available on data.world and the US Census Bureau website.
Building the Model:
To determine the model, both stepwise and best subsets were used to determine best fit. Before stepwise regression, the full model was evaluated:
According to the summary of the full model, the adjusted R-squared is 0.6588, indicating that the full model is explaining 65.8% of the variance of the response variable. Since a p-value below .001 (2.2e-16), this association does not appear to have occurred by chance. Based on the results of the ANOVA of the full model, we can predict that there are several explanatory variables, including tuition and enrollment that could possibly be significant predictors for determining the best model fit.
Regression Assumptions of the Final Model:
The next step is to evaluate the regression assumption. The assumptions are listed below:
- Linear: Mean ranking at each set of the explanatory variables is a linear function of the explanatory variables.
- Independent: Any observation in the data set do not rely on each other.
- Normal: Ranking at each set of the explanatory variables is normally distributed.
- Equal Variance: Ranking at each set of the explanatory variables has equal variance (i.e. homoscedastic).
To analyze linearity and equal variance, residual vs. fitted value plot is used. To evaluate normality, a normal Q-Q plot is generated:
According to the residual vs. fitted value plot, we can see no pattern in the data and conclude that the equal variance assumptions has been met. According to the Q-Q plot, we can see some deviation at the tails of the distrubition, but it appears that the normality assumption has been met.
The Cook’s distance plot indicates three potential outliers influencing the line of best fit. Surprisedly, BYU was not one of the outliers exerting he most leverge on the model. Instead, those were:
- University of Central Florida (#51) – Rank: 176; Tuition: 22467; Enrollment: 54513.
- University of Hawaii at Manoa (#56) – Rank: 169; Tuition: 33764; Enrollment: 13689.
- SUNY College of Environmental Science and Forestry (#141) – Rank: 99; Tuition: 17620; Enrollment: 1839.
Based on these finding, the full model should suffice for concluding that there is a meaningful relationship between tuition, enrollment, and university ranking. However, further exploration is needed to determing whether or not this is the best model to explain the potential relationship between explanatory variables and the response variable.
Following both the stepwise and best subsets regression, we see that tuition, enrollment, and population are recommended as predictors in the regression model:
When comparing the reduced model to the full model through an F-test, we see that there is not a significant difference (p-value: .77) between the two models:
The stepwise regression indictes that the model with tuition, enrollment, and population has an AIC of 1628.18, while another model that includes cost of living has an AIC of 1630. A model that accounts for the interaction between population and cost of living is worth exploring. However, after tests for multicollinearity using Variance Inflation Factors we see significant evidence of multicollinearity between population & cost of living:
Finally, a Variance Inflation Factor (VIF) test was conducted on the reduced model, which found no evidence of multicollinearity, suggesting that the reduced model is a better fit. Once the VIF test was conducted, assumptions were checked again finding similar results as the full model, with all conditions being met. The model was then cross validated using k-fold.
Summary & Conclusions:
After the initial exploratory data analysis (EDA) found here , a number of patterns emerged. Clearly, it appears that cost of tuition is strongly associated with ranking in the US News & World Report. What was not clear is the effects of the other variables (enrollment, region, & cost of living) on the response variable. While enrollment is weakly associated with ranking, it is moderately associated with tuition and cost of living. After initially analyzing the full model, we find that there is statistical evidence of a relationship between tuition and enrollment on university ranking. After examining best subsets and stepwise regression, it is suggested that we use a model which included tuition, enrollment, and population as the predictor variables. Comparing this model against the full model did not yield a significant difference between the two and suggested that the smaller model that only examined tuition and enrollment would yield the best results. An additional model (model 3) was investigated to include the interaction of cost of living and population, but found significant evidence of multicollinearity between those two variables. Multicollinearity was examined in the reduced model (model 2) using a VIF test, finding no evidence of multicollinearity within that model. Assumptions of linearity, normality, and equal variance were satisfied after examining a plot of residuals vs. fitted values as well as QQ plot of residuals. With a sample of 221 observations and a p-value of less than .001, we have statistical evidence to suggest that tuition, enrollment, and population are significant predictors of performance in the US News and World Report’s Best College Ranking. The final model is summarized below:
As previously stated, cost of living data was available by state instead of by city or country where the university was listed. So, a university that is located in a community with a high cost of living may be in a state with an overall low COL index score, and vice versa. This eliminates some precision in our predictions. In addition, this list consists of the 231 schools which opten to participate in the US News & World Report Ranking Index. According to the US News, there are over 4,000 college and universities in the United States. This raises the concern of non-response bias and limits the generalizability beyond the scope of participating institutions in the US News Rankings.
One example of this is the University of Minnesota, which chose not to participate in the US News Best College Rankings. Minnesota’s in-state undergraduate tuition and fees are $14,142. The enrollment is 19,819, and the state population is 5,568,155 (in 2017):
This results in a 95% confidence interval of 111 to 266 for the University of Minnesota’s US News Best College Ranking. However, when we compare their US News 2019 ranking, we see that UM is ranked #76 (tied with Virginia Tech). This suggests that this model is not an accurate predictor of school ranking, but rather serves as an illustration of overall national trends between tuition, enrollment, and population with regard to university ranking in this report.
R Code Chunks:
# Heat Map Correlation Plot: heat <- cor(rankingsreduced) corrplot(heat, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45) # Full Model with all variables: fullmodel <- lm(Rank ~ Tuition + Enrollment + Region + CostOfLiving + Population, data = rankings ) # Model Summary (Full Model): summary(fullmodel) # ANOVA Table (Full Model: anova(fullmodel) # Residuals vs. Fitted: mplot(fullmodel, which =1) #QQ PLot: mplot(fullmodel, which =2) # Cook's Distance: mplot(fullmodel, which =4) #Stepwise Regression: step(fullmodel, direction="both") # Best subsets: BestSubsets <- regsubsets(Rank ~ Tuition + Enrollment + Region + CostOfLiving + Population, data = rankings, method = "exhaustive", nbest = 2) Result <- summary(BestSubsets) # Append fit statistics to include R^2, adj R^2, Mallows' Cp, BIC: data.frame(Result$outmat, Result$rsq, Result$adjr2, Result$cp, Result$bic) # Model #2 (Based on Stepwise & Best Fits): mod2 <- lm(Rank ~ Tuition + Enrollment + Population, data = rankings) # Model Summary (Reduced Model): msummary(mod2) # Model Comparrison between Full & Reduced Model: anova(mod2, fullmodel) # Model with Interaction of Population & Cost of Living: mod3 <- lm(Rank ~ Tuition + Enrollment + Population + CostOfLiving + Population:CostOfLiving, data = rankings) # Model Summary (Model 3): msummary(mod3) # Variance inflation factor (Model 3): VIFtest1 <- lm(formula = Rank ~ Tuition + Enrollment + Population + CostOfLiving + Population:CostOfLiving, data = rankings) vif(VIFtest1) VIFtest2 <- lm(formula = Rank ~ Tuition + Enrollment + Population, data = rankings) #Variance inflation factor (small model) vif(VIFtest2) # Checking model accuracy against "real world" data: minnesota <- data.frame(Tuition = 14142, Enrollment = 29819, Population = 5568155) #Confidence Interval: predict(mod2, minnesota, interval="prediction")