Thanks to everyone who commented and read through my post last night. The post got a lot more attention than I expected (on hacker news and reddit at least). Many comments both here and on those threads quite reasonably pointed out problems with the data presented. I should have been a lot more clear initially about the caveats and issues, and put those at the front of the post instead of the end.
I’d like to try to address some of the concerns raised when possible, and be clear about which problems I don’t see an easy way of fixing:
Many commenters have noted that the results are not significant without including confidence measures. In retrospect I should have calculated confidence intervals from the beginning instead of just the mean values; I had assumed incorrectly that the n=100 cutoff would keep the error low enough to ignore, but that was a mistake. Below is an updated graph with 95% confidence intervals:
and the numbers:
As it turns out, the commenters who noted that the top and bottom languages were likely because of small samples were correct. Although the confidence ranges of the top and bottom groups don’t overlap, the difference is not as clear-cut as the means would suggest.
I’m going to try to gather some data from sparser-represented languages to clean this up, and will update here when I have better numbers (this might take a while because of API rate limiting.)
Household Income vs Personal Income
Many commenters noted that these numbers use household income rather than personal income. This is a limitation of the data sets I’m using rather than voluntary; the Rapleaf API only returns household income. Rather than give up I decided to use the household measure instead.
This is not ideal, but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages. This is not impossible, but I think the results are still useful with this caveat in mind.
If anyone can suggest a data set with personal incomes I can use instead, I’ll gladly use those. Otherwise I’ll be more clear that the incomes are household rather than personal.
Correcting for Confounding Variables
The original numbers did not attempt to adjust for any other variables, some of the more obvious being age and location. It’s been suggested that I look into using partial dependence plots to separate out other variables. I’ll be taking a look at that over the next few days.
Unfortunately there’s not a lot I can do about many missing languages; many are not recognized by GitHub (SQL, among others). As I gather more data, I’ll include the languages which were omitted here because of sample size.
Thanks again to everyone who read and commented. I’m going to process the lessons here and be more careful when posting numbers in the future (I’d still like to give similar breakdowns for gender and age soon.)
18 thoughts on “Updates to language vs income breakdown post”
I think the median income would be more useful than average to take account of statistical outliers. At the national level (not just the software development), most countries talk about median income. It would be great if you can provide a chart of median income.
You should have made this chart and table using D3 !
Interesting. Among comments on previous post, the correlation with programmers’ age touched on a significant point, but I suspect that geographic correlation should be much more significant.
There is quite some difference in incomes between US, Eastern Europe, India, etc. Different social and age trends would also lead to different patterns in people from different regions contributing to Open Source projects.
And of cause, commercial vs. open source: absence of Cobol is telling hips about this difference. 🙂
I would need to double check, but I suspect the income data I’m using here is almost exclusively US (another thing I should have mentioned originally), in which case it wouldn’t be thrown off by international incomes. Of course, within the US there are definitely differences as well…
“…but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages.”
I think you may be missing the most important contributor to a discrepancy between household and individual income and that is whether or not a person is married/cohabitating at all. If there is any truth to the conventional wisdom that some languages are hip and favored by the young crowd then this would likely significantly distort the income numbers because young people are less likely to be married. Therefore, you would largely be considering individual incomes versus household incomes when comparing a language favored by the young versus one favored by the experienced.
You point out the data we really want may not be available. To this I only wish to add to add that sometimes misleading data is worse than no data at all.
PS, a lot of your traffic may have been due to the fact that your original post was featured in the CodeProject newsletter. That’s how I found you. They have 10 million members, although not all of them subscribe to the newsletter.
True, that is a possibility. Possibly that could be mitigated by comparing language + age combinations specifically, to compare the incomes of 30 year old JS developers vs 30 year old Java developers?
Thanks for pointing out CodeProject, I actually hadn’t heard of it before. I just signed up for the newsletter.
I would also sort languages with respect to the length of the confidence intervals. Did you try resampling to estimating uncertainty?
why did you delete my comment?
I didn’t delete any comments, they just don’t appear until approved.
A huge problem with this data is the absence of outliers. Look at how closely clustered the data is. Especially considering that this is household income, there should be a higher variation–the data should be more spread out.
Language Mean Min Max Samples
Python 97,578 95,481 99,676 2314
Ruby 98,238 96,471 100,005 3242
in this update, you included the min and max. I have selected three languages for brevity from the updated data you provided. In ruby, there is a min of 96K and a max of 100K, for example. There is no way that if you sample 3242 ruby programmers that none is going to make less than 96K and none will make more than 100K. Just not going to happen. There will be a much much wider variation in incomes, especially for household income because some will be married, and some spouses will work, some will have better paying jobs.
But even if the data were for programmer incomes alone, the data should be more spread out.
Same goes for all the other languages.
There is something very wrong with this data.
Sorry, that was labeled confusingly before. Those are just the upper and lower confidence bounds from the graph above, not overall min / max points.
so where is the data?
You are missing Fortran. This would get some really ancient programmers, and probably high earners.
The data only (I guess) project most web applications that does multimedia and social networks. We need serious (like banks/insurance firms) business online applications that requires “solid” programming languages on it. Well, definitely Java is there but where is J2EE, JSP, and even Cobol, ASP or ASP.net??
J2EE and JSP projects should count as Java projects I believe (I don’t see any specific label for them on GitHub.)
GitHub doesn’t recognize Cobol or ASP.NET as languages in their language breakdowns. ASP has a few projects but not many.
Interesting how this doesn’t really agree with another post using salary data from Indeed: http://blog.panictank.net/passionate-programmer-programming-language-and-wage-premium/.