This Step-by-Step post is the first in a series of 3 blogposts, each post will be written by an ESI analyst and used to explain how we use a particular software as it pertains to our business lines and past projects. This week, James Midkiff shows us how to use the statistical software R.
For those of you familiar with the internal happenings at ESI, you may already know that we are gradually shifting our work onto a new statistical software called “R”. For those who are not in the know, R is essentially a beefed-up version of Excel in that it performs many of the same tasks that excel does – merging data, data manipulation, graphing, etc. – but faster, with more options, and with larger data sets than excel can handle. Furthermore, it is code-based thus allowing you to alter and re-run your work again and again without destroying your original data set. Of course, it also does many tasks that Excel cannot, such as complex regression and statistical analysis. But the two biggest selling points of R, are that it is completely free and that the code is open-source, which allows people to design add-on packages to extend R’s functionality. My goal with this blog post is to offer you a window into data work here at ESI and how to use R for graphing.
A screenshot of the four important sections I see when working in R: These are (clockwise from top-left) the Source area where I compile my code, the Environment, which shows my datasets in use, the Plot area, which displays the graphs I make, and the Console, which shows the results of my code. Datasets stay behind the Source area unless I choose to make them a separate window.
The directors at ESI have sought out specific projects with which analysts can strengthen their R skills; for me this opportunity came with a new project from the Philadelphia Real Estate Council (PREC). PREC had conducted a survey from the second quarter (Q2) of 2011 to the present in which they asked real estate professionals whether they thought real estate conditions had worsened, stayed the same, or improved over the previous quarter and how they would change in the next quarter. The survey looked at:
1) General conditions for commercial real estate
2) Commercial rents (higher is better)
3) Commercial operating expenses (lower is better)
4) Capitalization rates (lower is better)
5) Credit Conditions
6) Personal gross assets (higher is better)
Our job was to see if there were any connections between how people thought real estate conditions would be and how real estate conditions actually turned out to be, and then to present this information visually.
The results from each survey quarter came in separate excel tabs, so the first task was to manually compile the data into one excel document. I then saved the information as a Comma-Separated-Value file (.csv) in order to transfer it to R. A CSV file makes it incredibly easy to transfer spreadsheet data between different software because it separates every cell of data with a comma, which the receiving system then knows to use as a data separator. CSV files are beautiful because they are a ubiquitous standard, and standardization equals efficiency. Once in R, I aligned the past-looking and future-looking questions together and generated a mean-sentiment value to graph for each question. The sentiment value ranged from -1 if everyone thought conditions had worsened to 1 if everyone thought conditions had improved. A mean sentiment of zero means either that everyone thought conditions had stayed the same or that as many people thought conditions had improved as those who thought conditions had worsened.
There are many ways to plot data in R; I used the popular add-on package ggplot2. What is remarkable about ggplot2 is that it allows you to construct plots via layering. I knew I wanted to create a line graph that also had points for each question’s mean sentiment value. In addition, I needed to group the future-looking responses separately from the past-looking responses. So I started with those three layers, which I will show on the graph. Any further edits to a layer I will then show in parentheses.
This plot looks like a monstrosity. The x-axis displays the year and the y-axis displays the mean sentiments, and I successfully grouped the future-looking and past-looking questions by color, but there is still way too much going on here. It thus makes sense to add a layer displaying the six different questions on different plots.
Much better! You can now start to see some of the trends in the data. But looking at it, I wanted to improve the graph’s clarity by better distinguishing the forward-looking and backward-looking groups and by improving the legend. Plus, it made sense to show the number of responses to each question since few responses to a question could cause the mean value to be an outlier.
First, I improved the grouping by giving the future-responses a light blue color, a high level of transparency, and a dashed-line style and by giving the past-responses a solid dark green line. I then went back to my first layer and sized the points based upon the number of responses to that question. Additionally, I shifted the points rightwards so that responses for Q1 2012, for instance, appear past the 2012 line rather than directly on the line as they had previously. Then I cleaned up the legends. With a few more touch-up effects, the plots should be done.
The Final Product:
Finished! I added a title, subtitle, and a dashed horizontal line where the mean sentiment equals zero. And with a single line of code, I applied the pre-loaded theme used by the news website FiveThirtyEight to spruce up the background and fonts. Looking at the graph, some interesting trends become clear:
- Respondents’ predictions for the next quarter closely mirrored how they actually felt after that quarter had passed, even if they had slightly more pessimistic predictions about capitalization rates or credit conditions than how those specific categories actually turned out.
- For five of the six questions, respondents thought conditions were improving from late 2012 through early 2016. Respondents almost always thought expenses had gotten worse, however.
- Except for Q1 2012 which appears to be an outlier, opinion tends to shift gradually over time rather than abruptly.
- And with the exception of General Conditions, respondents became less optimistic (or more pessimistic) about all categories from Q2 2016 through Q1 2017.
After I completed this work, Director Jonathan Tannen presented this graph and several others to PREC who appreciated the data visualization. Given that the code to make the graphs and perform the analysis is saved, it is now easy to update the results whenever ESI receives the latest quarter of PREC’s survey results. In fact, we have already received and inputted the predictions for Q2 2017, which are not in the above graph. I can tell you, however, that survey respondents are becoming more optimistic about Q2 2017 compared to their predictions for Q1 2017. Does this mean that real estate conditions will improve in the next quarter? Given the strong correlation between respondents’ predictions and actual conditions, it very well might.
Special thanks to Jonathan Tannen for his assistance with this project!
James Midkiff was previously a Research Analyst at Econsult Solutions, Inc. and joined ESI after graduating from George Washington University with double majors in International Affairs and Economics and a minor in Spanish.