Thanks to everyone who commented and read through my post last night. The post got a lot more attention than I expected (on hacker news and reddit at least). Many comments both here and on those threads quite reasonably pointed out problems with the data presented. I should have been a lot more clear initially about the caveats and issues, and put those at the front of the post instead of the end.
I’d like to try to address some of the concerns raised when possible, and be clear about which problems I don’t see an easy way of fixing:
Many commenters have noted that the results are not significant without including confidence measures. In retrospect I should have calculated confidence intervals from the beginning instead of just the mean values; I had assumed incorrectly that the n=100 cutoff would keep the error low enough to ignore, but that was a mistake. Below is an updated graph with 95% confidence intervals:
and the numbers:
As it turns out, the commenters who noted that the top and bottom languages were likely because of small samples were correct. Although the confidence ranges of the top and bottom groups don’t overlap, the difference is not as clear-cut as the means would suggest.
I’m going to try to gather some data from sparser-represented languages to clean this up, and will update here when I have better numbers (this might take a while because of API rate limiting.)
Household Income vs Personal Income
Many commenters noted that these numbers use household income rather than personal income. This is a limitation of the data sets I’m using rather than voluntary; the Rapleaf API only returns household income. Rather than give up I decided to use the household measure instead.
This is not ideal, but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages. This is not impossible, but I think the results are still useful with this caveat in mind.
If anyone can suggest a data set with personal incomes I can use instead, I’ll gladly use those. Otherwise I’ll be more clear that the incomes are household rather than personal.
Correcting for Confounding Variables
The original numbers did not attempt to adjust for any other variables, some of the more obvious being age and location. It’s been suggested that I look into using partial dependence plots to separate out other variables. I’ll be taking a look at that over the next few days.
Unfortunately there’s not a lot I can do about many missing languages; many are not recognized by GitHub (SQL, among others). As I gather more data, I’ll include the languages which were omitted here because of sample size.
Thanks again to everyone who read and commented. I’m going to process the lessons here and be more careful when posting numbers in the future (I’d still like to give similar breakdowns for gender and age soon.)