If you were trained in a social science, there is a good chance that you have had to use, or still use SPSS. For most social science research, SPSS is a powerful program and arguably the industry standard in academia, especially in the social sciences. Statistical Package for Social Sciences (SPSS) is a menu driven software that is easy enough to learn – as long as teachers provide assignments and tutorials – to get you analyzing data quickly. In other words, SPSS is a good program for statistical analysis, and the barrier for entry is not too high, making it a great fit for social scientists who are just learning statistics. However, SPSS does have its limitations and there are options out there that are not only more robust, but completely open source (i.e. free).
Open Course Software:
One of the advantages of menu-driven statistical software has been the ease of use. Fortunately, many of the open source options – like Python and R – have come a long way towards lowering the learning curve in recent years. Graphical User Interfaces (GUIs), such as RStudio & Jupyter Notebooks (among many others) have made these programs much easier to learn, and come with many features unavailable in menu-driven programs like SPSS. In addition, when working on projects with Python or R, you can save them in a way that allows you to replicate, expand and share your analysis easily. Finally, what really sets languages like R and Python apart – besides being free – is that there a number of packages available from a community of people who like to share and help answer problems. This culture of learning and sharing has created options to do advanced statistical analysis and modeling – including machine learning – in addition to highly customizable data visualizations and tools for extracting, transforming, and loading (ETL) to create efficient data pipelines.
The ability to be transparent in your data analysis cannot be stressed enough. In statistics, it is easy to put a variable in the wrong place or select the wrong test, but still get results that are statistically significant (and wrong). Typically the people who are reading your analysis don’t get to see your raw data and the processes you went through with your analysis. Black box programs, like SPSS, only compound this. With language based programs like R and Python, you show every aspect of your work along the way; providing greater credibility in the process. Letting other researchers see how you reached your conclusion – or at least giving them the ability to – only strengthens your research and analysis.
One more advantage to learning a code based program like Python or R is that those skills are in demand. If you are a graduate student or just an academic that is looking for jobs outside of the academy, having skills in data science are very marketable. Industries everywhere are wanting to gain greater insights into their sales, operations, and consumer base. In addition, the federal government and many academic institutions need people with skills to handle the large databases that aren’t as easily compatible with many of the menu driven software programs like SPSS.
What about SAS and Stata?
Both SAS and Stata are statistical software packages that rely on a code based language. These programs are not free, but they are very powerful software programs that offer a lot of options, and work with large datasets. For years, SAS has been the preferred software for manufacturing and the medical field while Stata is often used in the political science and survey field. Consequently, some organizations simply use it because they don’t want to take the time and money to train their employees in other languages, not to mention the legacy code they have relied on for years, if not decades. The big drawback of these two programs, besides cost, is you can’t easily access the packages available in open source software, like Python or R.
Versatility of Python and R
If you know how to program in Python or R, it much easier to switch over to other languages. To effectively code in Python or R, you have to clearly understand your variables and the algorithm. In other words, you have to tell the software exactly what to do. Once you get past the basics, these languages are not difficult to understand. The problem for most people though is getting past the basics, which just takes some patience, and honestly a lot of trial and error. Once you are comfortable summarizing, analyzing and visualizing data with language based software, you’ll have a deeper understanding of what the data is actually doing. More importantly, you can easily transfer those skills to other programs.
Personally, I began on SPSS. It did more than what I needed for my classes in behavioral statistics. When I took a couple of applied statistics course we used Minitab, in part because it was developed and is still housed at Penn State (so it was free to us). Those two applied stats classes didn’t care what software you used, but they had clear tutorials that went along with our free access to Minitab, so I generally used that. Eventually, I started taking even more advanced stat classes which worked exclusively in R. We were required to turn all homework in through Markdown, knitting the document to html. We got up and running quickly with the Mosaic package and were able to do some interesting analysis and date visualization right away. After I completed my graduate certificate in Applied Statistics, I just kept on learning how to code better and analyze data in R. Once I had solid foundations with R, I was able to learn other languages like SQL, SAS, & Python fairly easily (when needed), because the conceptual foundation had already been set. As a result, I ended up being competitive for numerous jobs outside of academia in both government and industry.
Which option to choose?
This choice all depends on what kinds of work you see yourself doing. Personally I prefer R, because it works like a really fancy calculator, making it great for modeling. Also, there are great packages for visualizing like GGPlot, in addition to packages like dplyR for extracting, transforming, and loading (ETL) data. With that being said, I have worked at places where everyone uses SAS, so I did too. Finally, most positions in data science & statistics now typically list Python (in addition to SQL) as necessary languages. So if I were just starting out, that is probably where I would invest my time. Relatedly, there is one software program you will almost never find in data job postings: SPSS.
Thanks for reading!