Data Comparison on Cost of Living in Different States
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Joseph Wang. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - R Shiny (due on the 4th week of the program).
Motivation:
In the past few years, I have traveled across the country as a postdoctoral and industrial researcher. This is in part due to the harshness of the permanent academic environment, and also the down turn in the energy industry, especially back in my home town of Houston, Texas. The most immediately need for my family is to look for a new job opportunity near an area which is affordable for long term living. It occurs to me that it would be great to have a Shiny data application in R, which will be useful for people who want to or need to relocate.
My goal is to show some facts that may go against intuition about how things should work on the matter of cost of living standards and salary ranges in different cities. On the other hand, I also would like to examine if the income for data scientists is paid fairly according to living standards, and hopefully make a suggestion on top cities for data scientists in terms of realistic aspects of life.
Data Sources:
Two sets of data for my analysis are used. The first set is the cost of living indices for 325 cities national wide in 2010. All the indices for each category are relative to the national average of 100%, which includes 13% grocery items, 29% housing, 10% utilities, 12% transportation, 4% health Care, and 32% Miscellaneous Goods and Services. The data can be accessed through the following website: "http://www.infoplease.com/business/economy/cost-living-index-us-cities.html".
The other set of data collected is the average income for data scientists for major cities in 2016, which includes San Jose, CA; San Francisco, CA; Seattle, WA; New York City (Manhattan), NY; San Diego, CA; Boston, MA; Los Angeles-Long, Beach, CA; Austin, TX; Chicago, IL; Atlanta, GA; Minneapolis, MN; and Washington DC Metropolitan Area. Even though both sets of data are not taken at the same year, it should not change our interpretation which is mostly based on relative measures among cities.
Application Demo:
Here I will introduce the application visually to explain the features and how to operate it properly. On the upper left of the corner, the scroll bar represents the income one can live comfortably before relocation. The city one is currently located (in blue) and the city one plans to move to (in red) can be selected by the selection widget below the scroll bar. The anticipated salary estimation based on the ratio of overall composite living index is shown in the bar chart after the selection of cities.
The other bar chart titled as actual median annual salary for data scientists at 2016 follows immediately to reflect how the data scientists are actually paid against the living standard at the corresponding cities. The city names show up immediately on top of the panel reflects the existence of the data for the data scientist income.
On the top of the right hand section, the living cost indices for detailed components are shown with value 100 represented the national average for all the cities in the survey. The other nice feature is the integration with Google Maps to show the geographical location of the destination. Using this map, one can check the neighborhood and businesses of the city by zooming in. This can be used to determine the quality of life in each city one is considering moving to, such as shopping and amenities.
Data Exploration through Shiny GUI applications:
In this section, I applied this application to evaluate if data scientists are paid fairly by national living standards. For the extreme case, we see that data scientists in New York City are not paid according to the overall living cost. By fixing the salary at the value of the average income in San Francisco, CA (approximately $120k), we would expect the data scientists should be paid an annual average wage above $150k.
However, we observe the actual pay in 2016 in New York City is only about $100k. One can play the same game by changing the destination city to Washington D.C. area. I found the data scientists in D.C. are one of the worst paid groups in major cities in the nation.
As far as the cities where data scientists are paid fairly, we show that the city of Seattle, Washington is pretty good, even though the actual median salary number is low in the nation, but the lower living cost can justify this. One can end up with additional $10k annual savings in Seattle. By my studies, I found that the cities of Atlanta, Georgia; Austin, Texas; and Chicago, Illinois are the places where data scientists can have much less financial concerns in the long run.
The other interesting aspect I find is that the data scientist salaries seem to coincide with the trend of the living standard when one takes the housing factor out, as illustrated for the application demo. It seems to me that companies do not factor the employee's home ownership cost into their salary offers. As one can imagine, the majority of living costs is going to be dominated by housing, and this salary is not good enough to compensate for the cost of living in highly populated cities.
Conclusions and discussions:
In this project, I demonstrated the trial version of GUI application for the living cost estimation in R. For research purposes, I show how one can explore the correlation between living costs and earnings for data scientists. For the need of relocation, the estimator can also provide a clue about the relative living cost differences in different cities. It would be interesting to see if our conclusion holds for other occupations beyond data scientists. The living cost indices alone only provide the relative information between cities.
To draw absolutely quantitative information, one would need to find the national average spending in each category. With additional information, one could make a statistical machine learning process to predict the actual salary earned based on the spending in each category of the indices. The main assumption in the study is that the relative differences in living costs are not time sensitive within a 6-year range.