# Survey Data Analysis (The Hard Way)

## Introduction

In the summer of 2019, Penn State held 38 New Student Orientation (NSO) sessions and 3 International Student Orientation (ISO) sessions, during which all incoming freshman watched an interactive play titled “Results Will Matter.” This play touches on a variety of topics related to the college experience and typical pitfalls for students in their first year. At the conclusion of each night of the play, incoming students filled out a card with questions they had about the play and/or the college experience.

The “Results Will Vary” database consists of 7,558 handwritten responses that incoming freshman provided after given the following prompt: “What lingering questions do you have regarding the show.” The 2019 freshman class at University Park was approximately 8,000 students, providing a coverage rate close to 100%, and a response rate approaching 95%. With this information, we have the opportunity to gain perspectives on the effectiveness of the play and the general concerns of incoming students. After reading, counting, and numbering these cards, some themes began to emerge. Questions of consent, consequences of underage drinking, alcohol, drugs, roommate issues, and various campus services rose to the top.

## Sampling Procedure

Due to the size of the dataset and available resources, the decision was made to conduct a sample from the full dataset (N = 7,588). Since we knew the size of the population, the desired sample size was calculated using the finite population correction:

Since this data is exploratory in nature, we have an unknown population proportion and choose to use the most conservative estimate of .5, resulting in a desired sample of 366 cards:

The sample itself was conducted across all observations using a random number generator. The size of this sample provides us with ± 5% margin of error at the .05 level of significance, and all analysis was conducted strictly on the sampled data.

## Cleaning & Inspecting the Data

Once the cards were sampled, the text from each one was entered into excel verbatim, and open coded once again. Using RStudio, the text was cleaned by converting to lowercase and removing punctuation. In text analysis, it is standard practice to remove extremely common words such as “if”, “and”, “but”, and “the,” as they have little to no value in determining key terms in the vocabulary. These words are called “stop words.”

Zipf’s law, named after linguist and mathematician George Zipf, states that given a large sample of words, the frequency of any word is inversely proportional to its rank in the frequency table. This creates a long-tailed distribution as the number of words approaches infinity, with the most frequent word occurring approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. By removing “stop words”, we trim the bulk of the data along the Zipfian distribution and gain a greater understanding of what the central themes are in the data. Within this database, we removed standard English stop words in addition to a custom list of stop words to uncover the central themes within the data. Those words were “can”, “people”, “get”, “penn”, “state”, “student”, “thing”, “what”, “what’s”, “just”, “one”, “know”, “like”, “students”, and “campus”.

Once “stop words” were removed, the result was 792 unique words, represented in a word cloud below:

Upon further inspection, we see that the most frequent words are associated with alcohol, safety, health services, consent, and the consequences of underage drinking. These can be seen in the frequency plot below:

## Developing Themes

After inspecting term frequency, open codes were revisited, resulting in 42 unique codes, with some observations requiring multiple codes. For example, one survey respondent wrote:

“I want to get involved in LGBT clubs / activities, but I am not out to my parents and they aren’t very accepting. Will they find out I am in those organizations (Through the internet or something)? Like, can they see what clubs I join?”

This example fell under the following four codes:

1. Student resources
2. FERPA / HIPPA / Privacy
3. LGBTQ Community, concerns, resources
4. Clubs

In addition, there were also some responses that didn’t fit within any themes, which were simply labeled “miscellaneous / unrelated.” An example of this was “What is your favorite color?” and “Who’s got the best gas on campus?”

Questions regarding alcohol, underage drinking, Responsible Action Protocol (RAP), and the consequences of alcohol/underage drinking occupied the largest portion of the data. In addition, questions of campus safety, student resources, residence halls, and consent emerged as central themes in the data. Feedback on the show was generally complimentary, with some questions regarding the intended meaning of scenes. For example, some participants cited the “ASMR scene” and the “Rollercoaster Scene” as areas of confusion. The distribution of the codes can be found in the frequency plot of all themes:

## Sentiment & Emotional Analysis

Sentiment analysis refers to the use of natural language processing, text analysis, and computational linguistics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews of products and services as well as open ended survey data. With RStudio, we have access to open sourced packages that use natural language processing and text analysis to examine the sentiment and emotional content of each observation. Using the SentimentR package, text files were analyzed by comparing the data against known words that are associated with positive, negative, or neutral sentiments. Each sentence is extracted and scored from -1 to 1, with any sentences scored above .3 being considered positive, and any sentences below -.3 being considered negative. The sentences falling in the remaining 60% of the data are considered neutral. This data was overwhelmingly in the neutral category, with a mean of .031, a median of 0, and standard deviation of .243. When investigating sentiment by participant and organized by time, we see patterns in the data suggesting that the sentiment of the audience changed at times throughout the run of the show:

Finally, we conducted an emotional analysis once again using the SentimintR package in RStudio. This analysis is conducted by comparing known words associated with various emotions (anger, disgust, fear, joy, sadness, surprise, anticipation, trust, etc.) against the data set, classifying each sentence within an emotional category. Trust (n=153) was shown to be the most common emotion, with more than twice the number of occurrences of the second most common emotion, joy (n = 71). Fear (n = 59), anticipation (n=57), sadness (n = 38), surprise (n = 20), anger (n = 15), & disgust (n = 13) rounded out the top 10 emotions within the sampled data. These results are displayed in the frequency chart and pie chart below:

## Conclusion / Suggestions for the Future

The 2019 interactive theater experience, “Results Will Vary,” discussed some of the common pitfalls of first year students, including sex, consent, alcohol, drug use, and peer pressure. Following each airing of the show, the incoming freshman who viewed the production were asked: “what lingering questions do you have about the show?” These questions were written on index cards and collected at the end of each orientation session, resulting in 7,588 responses. Survey cards (N = 7,588) were then read, numbered, sampled (n = 366), transcribed, and analyzed. The results indicated that the top questions of students were related to the consequences of alcohol and underage drinking. In addition, questions regarding campus safety, student resources, residence halls, and consent emerged as central themes from the audience members. With the SentimentR package in RStudio, the text was analyzed for overall sentiment and emotional content, which suggested that students enjoyed the show, with the most common emotional response being “trust,” followed by “joy.” This production, which is written and performed by current Penn State students provides an interesting model for engaging in difficult conversations.

As the Office of New Student Orientation and the Penn State’s School of Theatre begin plans for the future, a survey instrument that includes both open and closed ended questions may provide a better window into student perceptions and understanding. In addition, implementation of an online questionnaire that can store and share results quickly through mobile devices should be considered. The inclusion of an online questionnaire, if properly executed, should also allow for more granularity within the data and increased speed of data collection and analysis.