Taxi Loading at SFO

I usually avoid catching Taxis whenever possible, but when I arrived in SFO last week the trains were no longer running and I hadn’t arranged for a shuttle, so I ended up waiting in line to catch a Taxi.  The line was structured something like this:

Taxi line 1 (1)

  • There was a loading area about four cars long where Taxis were loading passengers
  • Would-be passengers waited in line along the curb to the left, waiting for a Taxi
  • Likewise, taxis waited in line for passengers on the other side of the curb
  • As people loaded into Taxis and departed, each line advanced to the right, matching the front of the Taxi line with the front of the passenger line
  • An airport employee stood stood near the front of the line, shepherding people and cabs around to enforce this flow

Of course, this felt like an extremely inefficient system; I was waiting next to a cab which was waiting for a passenger; had we been allowed, I would have just jumped in the cab next to me and we both would have been happier.  However, since the line of people was denser than the Taxi line, I would have been cutting in front of other people in line.

In college I took a couple classes where we learned about queuing algorithms and the standard trade-offs involved.  On the ride back I thought about how they applied to the Taxi-loading situation here:

  • Throughput: how many passengers per hour could the system match to Taxis?   This was not being optimized for, or I could have gotten into the Taxi beside me.
  • Fairness: this was pretty clearly what was being optimized for–both the Taxi line and the passenger line were being processed in First-In-First-Out (FIFO) ordering. 
  • Average wait time:  I don’t think wait time was being taken into account, especially since passengers with less luggage (and therefore faster loading) would have been given priority over passengers with many bags.

A couple other issues were specific to this situation:

  • The matching process should not involve an inordinate amount of walking by prospective passengers (a passenger should never have to walk the entire length of the Taxi queue to find a cab)
  • If cabs frequently have to pass other cabs to advance to the head of the queue, it increases the odds of an accident (or of getting run over, if you are loading your bags into the trunk.)

I’d like to think that a better system exists (“there has to be a better way!”), even if it sacrifices some amount of fairness, since clearly this system would scale poorly if the airport was busier.

If anyone knows of airports/malls/etc that do a better job, I’d be interested in knowing how they manage it.  I didn’t waste an enormous amount of time in line (~10 minutes), but if the line is on average 50 people long, that’s actually a huge amount of time being squandered over the course of a year.

Posted in Algorithms | 1 Comment

Updates to language vs income breakdown post

Thanks to everyone who commented and read through my post last night.  The post got a lot more attention than I expected (on hacker news and reddit at least).    Many comments both here and on those threads quite reasonably pointed out problems with the data presented.  I should have been a lot more clear initially about the caveats and issues, and put those at the front of the post instead of the end.

I’d like to try to address some of the concerns raised when possible, and be clear about which problems I don’t see an easy way of fixing:

Confidence intervals

Many commenters have noted that the results are not significant without including confidence measures.  In retrospect I should have calculated confidence intervals from the beginning instead of just the mean values; I had assumed incorrectly that the n=100 cutoff would keep the error low enough to ignore, but that was a mistake.  Below is an updated graph with 95% confidence intervals:


and the numbers:

Language Mean Lower Upper Samples
Puppet 87,589.29 77,726.24 97,452.33 112
Haskell 89,973.82 82,773.72 97,173.92 191
PHP 94,031.19 90,956.90 97,105.47 978
CoffeeScript 94,890.80 90,025.16 99,756.45 435
VimL 94,967.11 90,735.70 99,198.51 532
Shell 96,930.54 93,771.76 100,089.33 979
Lua 96,930.69 86,169.26 107,692.13 101
Erlang 97,306.55 88,631.11 105,981.98 168
Clojure 97,500.00 91,448.24 103,551.76 269
Python 97,578.87 95,481.64 99,676.10 2314
JavaScript 97,598.75 95,897.67 99,299.83 3443
Emacs Lisp 97,774.65 92,503.64 103,045.65 355
C# 97,823.31 94,116.76 101,529.86 665
Ruby 98,238.74 96,471.81 100,005.68 3242
C++ 99,147.93 95,633.62 102,662.23 845
CSS 99,881.40 95,361.99 104,400.82 527
Perl 100,295.45 97,172.79 103,418.12 990
C 100,766.51 98,602.83 102,930.19 2120
Go 101,158.01 94,435.87 107,880.15 231
Scala 101,460.91 94,925.79 107,996.02 243
ColdFusion 101,536.70 93,627.35 109,446.05 109
Objective-C 101,801.60 97,560.43 106,042.77 562
Groovy 102,650.86 94,601.74 110,699.99 116
Java 103,179.39 100,474.36 105,884.42 1402
XSLT 106,199.19 96,887.72 115,510.65 123
ActionScript 108,119.47 99,297.36 116,941.58 113

As it turns out, the commenters who noted that the top and bottom languages were likely because of small samples were correct.  Although the confidence ranges of the top and bottom groups don’t overlap, the difference is not as clear-cut as the means would suggest.

I’m going to try to gather some data from sparser-represented languages to clean this up, and will update here when I have better numbers (this might take a while because of API rate limiting.)

Household Income vs Personal Income

Many commenters noted that these numbers use household income rather than personal income.  This is a limitation of the data sets I’m using rather than voluntary; the Rapleaf API only returns household income.  Rather than give up I decided to use the household measure instead.

This is not ideal, but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages.  This is not impossible, but I think the results are still useful with this caveat in mind.

If anyone can suggest a data set with personal incomes I can use instead, I’ll gladly use those.  Otherwise I’ll be more clear that the incomes are household rather than personal.

Correcting for Confounding Variables

The original numbers did not attempt to adjust for any other variables, some of the more obvious being age and location.  It’s been suggested that I look into using partial dependence plots to separate out other variables.  I’ll be taking a look at that over the next few days.

Missing Languages

Unfortunately there’s not a lot I can do about many missing languages; many are not recognized by GitHub (SQL, among others).  As I gather more data, I’ll include the languages which were omitted here because of sample size.

Thanks again to everyone who read and commented.  I’m going to process the lessons here and be more careful when posting numbers in the future (I’d still like to give similar breakdowns for gender and age soon.)

Posted in Uncategorized | 18 Comments

Average Income per Programming Language

Update 8/21:  I’ve gotten a lot of feedback about issues with these rankings from comments, and have tried to address some of them here The data there has been updated to include confidence intervals.


A few weeks ago I described how I used Git commit metadata plus the Rapleaf API to build aggregate demographic profiles for popular GitHub organizations (blog post here, per-organization data available here).

I was also interested in slicing the data somewhat differently, breaking down demographics per programming language instead of per organization.  Stereotypes about developers of various languages abound, but I was curious how these lined up with reality.  The easiest place to start was age, income, and gender breakdowns per language. Given the data I’d already collected, this wasn’t too challenging:

  • For each repository I used GitHub’s estimate of a repostory’s language composition.  For example, GitHub estimates this project at 75% Java.
  • For each language, I aggregated incomes for all developers who have contributed to a project which is at least 50% that language (by the above measure).
  • I filtered for languages with > 100 available income data points.

Here are the results for income, sorted from lowest average household income to highest:

Language Average Household Income ($) Data Points
Puppet 87,589.29 112
Haskell 89,973.82 191
PHP 94,031.19 978
CoffeeScript 94,890.80 435
VimL 94,967.11 532
Shell 96,930.54 979
Lua 96,930.69 101
Erlang 97,306.55 168
Clojure 97,500.00 269
Python 97,578.87 2314
JavaScript 97,598.75 3443
Emacs Lisp 97,774.65 355
C# 97,823.31 665
Ruby 98,238.74 3242
C++ 99,147.93 845
CSS 99,881.40 527
Perl 100,295.45 990
C 100,766.51 2120
Go 101,158.01 231
Scala 101,460.91 243
ColdFusion 101,536.70 109
Objective-C 101,801.60 562
Groovy 102,650.86 116
Java 103,179.39 1402
XSLT 106,199.19 123
ActionScript 108,119.47 113

Here’s the same data in chart form:

Language vs Income

Most of the language rankings were roughly in line with my expectations, to the extent I had any:

  • Haskell is a very academic language, and academia is not known for generous salaries
  • PHP is a very accessible language, and it makes sense that casual / younger / lower paid programmers can easily contribute
  • On the high end of the spectrum, Java and ActionScript are used heavily in enterprise software, and enterprise software is certainly known to pay well

On the other hand, I’m unfamiliar with some of the other languages on the high/low ends like XSLT, Puppet, and CoffeeScript.  Any ideas on why these languages ranked higher or lower than average?

Caveats before making too many conclusions from the data here:

  • These are all open-source projects, which may not accurately represent compensation among closed-source developers
  • Rapleaf data does not have total income coverage, and the sample may be biased
  • I have not corrected for any other skew (age, gender, etc)
  • I haven’t crawled all repositories on GitHub, so the users for whom I have data may not be a representative sample

That said, even though the absolute numbers may be biased, I think this is a good starting point when comparing relative compensation between languages.

Let me know any thoughts or suggestions about the methodology or the results.  I’ll follow up soon with age and gender breakdowns per language in a similar fashion.

Posted in Github, Open Source, Uncategorized, Visualization | 197 Comments

Using CoreNLP, d3.js, and dagre.js to visualize sentence parse trees

I’ve always been casually interested in the field of Natural Langauge Processing (NLP), a  field of computer science interested in extracting information from natural human language. I have no training or education whatsoever in the field so I’m not in a position to contribute much to the field, but I am definitely interested in seeing where the state of the art is, and in particular how powerful open-source NLP libraries have gotten (Google and Microsoft certainly have more powerful closed-source systems, but that doesn’t really help me.)

A few years ago I started playing with Apache’s OpenNLP project.  I’m a big fan of the Apache foundation and their libraries, but I found myself very frustrated by OpenNLP’s lack of documentation and the hacky-feeling interfaces the library exposed.  However recently I took another look at the available NLP libraries and came across Stanford’s CoreNLP project.   CoreNLP, as it turns out, is an awesome project, and it took almost zero effort to get their example demo working.

As a total NLP beginnner, the sentence parsing functionality was the most immediately approachable example.  Sentence parsing takes a natural-English sentence:

“I am parsing an example sentence.”

and breaks it down into component tokens and their relations:

(ROOT (S (NP (PRP I)) (VP (VBP am) (VP (VBG parsing) (NP (DT an) (NN example) (NN sentence)))) (. .)))

where each token type corresponds to a particular word type–“NP”  means “Noun Phrase”, VBG means “Verb, gerund or present participle”, and so forth (I’ve been referencing this as a complete token list.)

I’ve also been looking into JavaScript graph visualization libraries recently (I’ve struggled to find a JS library remotely as powerful and pretty as graphviz), and wanted to test out the dagre library, which re-implements a simplified dot algorithm in javascipt and can render the results to d3 (the current coolest-kid-on-the-block JS graph library).  So I put the two together and put together a simple visualization which uses dagre to show CoreNLP’s sentence parse tree.  It’s pretty simple, but you can play with it here.


When I have time to work with the two libraries a bit more I’ll hopefully update with something more interesting.

Posted in Uncategorized | 3 Comments

Github Demographics

For the past couple weeks I’ve been working on a project to visualize and compare the demographics of popular GitHub organizations by combining data from the the RapLeaf and GitHub APIs.   By pulling emails from Git commit data and querying the Rapleaf API for demographic data, I was able to put together an aggregate picture of the age + gender + income of people who have contributed to a GitHub organization (shown below for the Rails organization)


  • See more details on how the data was gathered here,
  • See organization ranked by age / gender / income here
  • Browse all available organizations here.

I’ll be following up soon with some thoughts on the results.  For now, I’ll just point out that Linux kernel developers make serious bank.

Posted in Uncategorized | 8 Comments

Fast asymmetric Hadoop joins using Bloom Filters and Cascading

In a recent post for the Liveramp blog I describe how we use Bloom filters to optimize our Hadoop jobs:

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

Check out the rest of the post here.

Posted in Open Source | Tagged , | Leave a comment