Update 8/21: I’ve gotten a lot of feedback about issues with these rankings from comments, and have tried to address some of them here. The data there has been updated to include confidence intervals.
A few weeks ago I described how I used Git commit metadata plus the Rapleaf API to build aggregate demographic profiles for popular GitHub organizations (blog post here, per-organization data available here).
I was also interested in slicing the data somewhat differently, breaking down demographics per programming language instead of per organization. Stereotypes about developers of various languages abound, but I was curious how these lined up with reality. The easiest place to start was age, income, and gender breakdowns per language. Given the data I’d already collected, this wasn’t too challenging:
- For each repository I used GitHub’s estimate of a repostory’s language composition. For example, GitHub estimates this project at 75% Java.
- For each language, I aggregated incomes for all developers who have contributed to a project which is at least 50% that language (by the above measure).
- I filtered for languages with > 100 available income data points.
Here are the results for income, sorted from lowest average household income to highest:
|Language||Average Household Income ($)||Data Points|
Here’s the same data in chart form:
Most of the language rankings were roughly in line with my expectations, to the extent I had any:
- Haskell is a very academic language, and academia is not known for generous salaries
- PHP is a very accessible language, and it makes sense that casual / younger / lower paid programmers can easily contribute
- On the high end of the spectrum, Java and ActionScript are used heavily in enterprise software, and enterprise software is certainly known to pay well
On the other hand, I’m unfamiliar with some of the other languages on the high/low ends like XSLT, Puppet, and CoffeeScript. Any ideas on why these languages ranked higher or lower than average?
Caveats before making too many conclusions from the data here:
- These are all open-source projects, which may not accurately represent compensation among closed-source developers
- Rapleaf data does not have total income coverage, and the sample may be biased
- I have not corrected for any other skew (age, gender, etc)
- I haven’t crawled all repositories on GitHub, so the users for whom I have data may not be a representative sample
That said, even though the absolute numbers may be biased, I think this is a good starting point when comparing relative compensation between languages.
Let me know any thoughts or suggestions about the methodology or the results. I’ll follow up soon with age and gender breakdowns per language in a similar fashion.