Updates to language vs income breakdown post

bpodgursky Uncategorized August 22, 2013August 24, 2013 2 Minutes

Thanks to everyone who commented and read through my post last night. The post got a lot more attention than I expected (on hacker news and reddit at least). Many comments both here and on those threads quite reasonably pointed out problems with the data presented. I should have been a lot more clear initially about the caveats and issues, and put those at the front of the post instead of the end.

I’d like to try to address some of the concerns raised when possible, and be clear about which problems I don’t see an easy way of fixing:

Confidence intervals

Many commenters have noted that the results are not significant without including confidence measures. In retrospect I should have calculated confidence intervals from the beginning instead of just the mean values; I had assumed incorrectly that the n=100 cutoff would keep the error low enough to ignore, but that was a mistake. Below is an updated graph with 95% confidence intervals:

and the numbers:

Language	Mean	Lower	Upper	Samples
Puppet	87,589.29	77,726.24	97,452.33	112
Haskell	89,973.82	82,773.72	97,173.92	191
PHP	94,031.19	90,956.90	97,105.47	978
CoffeeScript	94,890.80	90,025.16	99,756.45	435
VimL	94,967.11	90,735.70	99,198.51	532
Shell	96,930.54	93,771.76	100,089.33	979
Lua	96,930.69	86,169.26	107,692.13	101
Erlang	97,306.55	88,631.11	105,981.98	168
Clojure	97,500.00	91,448.24	103,551.76	269
Python	97,578.87	95,481.64	99,676.10	2314
JavaScript	97,598.75	95,897.67	99,299.83	3443
Emacs Lisp	97,774.65	92,503.64	103,045.65	355
C#	97,823.31	94,116.76	101,529.86	665
Ruby	98,238.74	96,471.81	100,005.68	3242
C++	99,147.93	95,633.62	102,662.23	845
CSS	99,881.40	95,361.99	104,400.82	527
Perl	100,295.45	97,172.79	103,418.12	990
C	100,766.51	98,602.83	102,930.19	2120
Go	101,158.01	94,435.87	107,880.15	231
Scala	101,460.91	94,925.79	107,996.02	243
ColdFusion	101,536.70	93,627.35	109,446.05	109
Objective-C	101,801.60	97,560.43	106,042.77	562
Groovy	102,650.86	94,601.74	110,699.99	116
Java	103,179.39	100,474.36	105,884.42	1402
XSLT	106,199.19	96,887.72	115,510.65	123
ActionScript	108,119.47	99,297.36	116,941.58	113

As it turns out, the commenters who noted that the top and bottom languages were likely because of small samples were correct. Although the confidence ranges of the top and bottom groups don’t overlap, the difference is not as clear-cut as the means would suggest.

I’m going to try to gather some data from sparser-represented languages to clean this up, and will update here when I have better numbers (this might take a while because of API rate limiting.)

Household Income vs Personal Income

Many commenters noted that these numbers use household income rather than personal income. This is a limitation of the data sets I’m using rather than voluntary; the Rapleaf API only returns household income. Rather than give up I decided to use the household measure instead.

This is not ideal, but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages. This is not impossible, but I think the results are still useful with this caveat in mind.

If anyone can suggest a data set with personal incomes I can use instead, I’ll gladly use those. Otherwise I’ll be more clear that the incomes are household rather than personal.

Correcting for Confounding Variables

The original numbers did not attempt to adjust for any other variables, some of the more obvious being age and location. It’s been suggested that I look into using partial dependence plots to separate out other variables. I’ll be taking a look at that over the next few days.

Missing Languages

Unfortunately there’s not a lot I can do about many missing languages; many are not recognized by GitHub (SQL, among others). As I gather more data, I’ll include the languages which were omitted here because of sample size.

Thanks again to everyone who read and commented. I’m going to process the lessons here and be more careful when posting numbers in the future (I’d still like to give similar breakdowns for gender and age soon.)

Published by bpodgursky

View all posts by bpodgursky

Published August 22, 2013August 24, 2013

18 thoughts on “Updates to language vs income breakdown post”

Pingback: Average Income per Programming Language | Ben Podgursky
Pan-Wei Ng says:

August 22, 2013 at 12:09 pm

I think the median income would be more useful than average to take account of statistical outliers. At the national level (not just the software development), most countries talk about median income. It would be great if you can provide a chart of median income.

Reply
Chris says:

August 22, 2013 at 2:11 pm

You should have made this chart and table using D3 !

Reply
1. bpodgursky says:
  
  August 22, 2013 at 5:40 pm
  
  I thought about that, but I don’t think I’m able to use d3 here since it requires javascript; I’d have to host wordpress myself.
  
  Reply
Dmitri says:

August 22, 2013 at 4:06 pm

Interesting. Among comments on previous post, the correlation with programmers’ age touched on a significant point, but I suspect that geographic correlation should be much more significant.
There is quite some difference in incomes between US, Eastern Europe, India, etc. Different social and age trends would also lead to different patterns in people from different regions contributing to Open Source projects.
And of cause, commercial vs. open source: absence of Cobol is telling hips about this difference. 🙂

Reply
1. bpodgursky says:
  
  August 24, 2013 at 4:04 am
  
  I would need to double check, but I suspect the income data I’m using here is almost exclusively US (another thing I should have mentioned originally), in which case it wouldn’t be thrown off by international incomes. Of course, within the US there are definitely differences as well…
  
  Reply
nathan says:

August 22, 2013 at 4:07 pm

“…but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages.”

I think you may be missing the most important contributor to a discrepancy between household and individual income and that is whether or not a person is married/cohabitating at all. If there is any truth to the conventional wisdom that some languages are hip and favored by the young crowd then this would likely significantly distort the income numbers because young people are less likely to be married. Therefore, you would largely be considering individual incomes versus household incomes when comparing a language favored by the young versus one favored by the experienced.

You point out the data we really want may not be available. To this I only wish to add to add that sometimes misleading data is worse than no data at all.

PS, a lot of your traffic may have been due to the fact that your original post was featured in the CodeProject newsletter. That’s how I found you. They have 10 million members, although not all of them subscribe to the newsletter.

Reply
1. bpodgursky says:
  
  August 24, 2013 at 4:15 am
  
  True, that is a possibility. Possibly that could be mitigated by comparing language + age combinations specifically, to compare the incomes of 30 year old JS developers vs 30 year old Java developers?
  
  Thanks for pointing out CodeProject, I actually hadn’t heard of it before. I just signed up for the newsletter.
  
  Reply
Vadim says:

August 23, 2013 at 10:48 pm

I would also sort languages with respect to the length of the confidence intervals. Did you try resampling to estimating uncertainty?

Reply
unperson102 says:

August 24, 2013 at 10:17 am

why did you delete my comment?

Reply
1. bpodgursky says:
  
  August 24, 2013 at 11:42 am
  
  I didn’t delete any comments, they just don’t appear until approved.
  
  Reply
unperson102 says:

August 24, 2013 at 10:22 am

A huge problem with this data is the absence of outliers. Look at how closely clustered the data is. Especially considering that this is household income, there should be a higher variation–the data should be more spread out.

Language Mean Min Max Samples

Python 97,578 95,481 99,676 2314
JavaScript 97,598 95,897 99,299 3443
Ruby 98,238 96,471 100,005 3242

in this update, you included the min and max. I have selected three languages for brevity from the updated data you provided. In ruby, there is a min of 96K and a max of 100K, for example. There is no way that if you sample 3242 ruby programmers that none is going to make less than 96K and none will make more than 100K. Just not going to happen. There will be a much much wider variation in incomes, especially for household income because some will be married, and some spouses will work, some will have better paying jobs.

But even if the data were for programmer incomes alone, the data should be more spread out.

Same goes for all the other languages.
There is something very wrong with this data.

Reply
1. bpodgursky says:
  
  August 24, 2013 at 11:41 am
  
  Sorry, that was labeled confusingly before. Those are just the upper and lower confidence bounds from the graph above, not overall min / max points.
  
  Reply
  1. unperson102 says:
    
    August 24, 2013 at 5:23 pm
    
    so where is the data?
    
    Reply
Robert Burlingame says:

August 25, 2013 at 2:15 am

You are missing Fortran. This would get some really ancient programmers, and probably high earners.

Reply
infoRene says:

September 2, 2013 at 1:11 am

The data only (I guess) project most web applications that does multimedia and social networks. We need serious (like banks/insurance firms) business online applications that requires “solid” programming languages on it. Well, definitely Java is there but where is J2EE, JSP, and even Cobol, ASP or ASP.net??

Reply
1. bpodgursky says:
  
  September 2, 2013 at 2:04 am
  
  J2EE and JSP projects should count as Java projects I believe (I don’t see any specific label for them on GitHub.)
  
  GitHub doesn’t recognize Cobol or ASP.NET as languages in their language breakdowns. ASP has a few projects but not many.
  
  Reply
Donnie Berkholz says:

September 19, 2013 at 3:50 pm

Interesting how this doesn’t really agree with another post using salary data from Indeed: http://blog.panictank.net/passionate-programmer-programming-language-and-wage-premium/.

Reply