Catalog of Life Taxonomic Tree

A small visualization I’ve wanted to do for a while is a tree of life graph — visualizing all known species and their relationships.

Recently I found that the Catalog of Life project has a very accessible database of all known species / taxonomic groups and their relationships, available for download here.  This let me put together a simple site backed by their database, available here:

http://taxontree.bpodgursky.com/

uncharted-screenshot

All the source code is available on Github.

Design

I’ve used dagre + d3 on a number of other graph visualization projects, so dagre-d3 was the natural choice for the  visualization component.  The actual code required to do the graph building and rendering is pretty trivial.

The data fetching was a bit trickier.  Since pre-loading tens of millions of records was obviously unrealistic, I had to implement a graph class (BackedBiGraph) which lazily expands and collapses, using user-provided callbacks to fetch new data.  In this case, the callbacks were just ajax calls back to the server.

The Catalog of Life database did not come with a Java client, so I thought this would be a good opportunity to use jOOQ to generate Java models and query builders corresponding to the COL database, since I am allergic to writing actual SQL queries.  This ended up working very well — configuring the jOOQ Maven plugin was simple, and the generated code made writing the queries trivial:

 private Collection<TaxonNodeInfo> taxonInfo(Condition condition) {
return context.select()
.from(TAXON)
.leftOuterJoin(TAXON_NAME_ELEMENT)
.on(TAXON.ID.equal(TAXON_NAME_ELEMENT.TAXON_ID))
.leftOuterJoin(SCIENTIFIC_NAME_ELEMENT)
.on(SCIENTIFIC_NAME_ELEMENT.ID.equal(TAXON_NAME_ELEMENT.SCIENTIFIC_NAME_ELEMENT_ID))
.where(condition).fetch().stream().map(record -> new TaxonNodeInfo(
record.into(TAXON),
record.into(TAXON_NAME_ELEMENT),
record.into(SCIENTIFIC_NAME_ELEMENT)
)).collect(Collectors.toList());
}

All in all, there are a lot of rough edges still, but dagre, d3 and jOOQ made this a much easier project than expected.  The code is on Github, so suggestions, improvements, or bugfixes are always welcome.

 

 

D3 NLP Visualizer Update

A couple years ago I put together a simple NLP parse tree visualizer demo which used d3 and the dagre layout library.  At the time, integrating dagre with d3 required a bit of manual work (or copy-paste); however, since then dagre-d3 library was split out of dagre which added an easy API for adding and removing nodes.  Even though the library isn’t under active development, I think it’s still the most powerful pure-JS directed graph layout library out there.

An example from the wiki shows how simple the the dagre-d3 API is for creating directed graphs, abbreviated here:

    // Create the input graph
    var g = new dagreD3.graphlib.Graph()
      .setGraph({})
      .setDefaultEdgeLabel(function() { return {}; });

    // Here we"re setting nodeclass, which is used by our custom drawNodes function
    // below.
    g.setNode(0,  { label: "TOP", class: "type-TOP" });

    // ... snip

    // Set up edges, no special attributes.
    g.setEdge(3, 4);
    
    // ... snip
    // Create the renderer
    var render = new dagreD3.render();

    var svg = d3.select("svg"),
        svgGroup = svg.append("g");

    // Run the renderer. This is what draws the final graph.
    render(d3.select("svg g"), g);

Since that original demo site still gets a fair amount of traffic, I thought it would be good to update it to use dagre-d3 instead of the original hand-rolled bindings (along with other cleanup).  You can see the change set required here.

The other goal was to re-familiarize myself with d3 and dagre, since I have a couple projects in mind which would make heavy use of both.  Hopefully I’ll have something to post here in the next couple months.

Average Income per Programming Language

Update 8/21:  I’ve gotten a lot of feedback about issues with these rankings from comments, and have tried to address some of them here The data there has been updated to include confidence intervals.

———————————————————————————————————

A few weeks ago I described how I used Git commit metadata plus the Rapleaf API to build aggregate demographic profiles for popular GitHub organizations (blog post here, per-organization data available here).

I was also interested in slicing the data somewhat differently, breaking down demographics per programming language instead of per organization.  Stereotypes about developers of various languages abound, but I was curious how these lined up with reality.  The easiest place to start was age, income, and gender breakdowns per language. Given the data I’d already collected, this wasn’t too challenging:

  • For each repository I used GitHub’s estimate of a repostory’s language composition.  For example, GitHub estimates this project at 75% Java.
  • For each language, I aggregated incomes for all developers who have contributed to a project which is at least 50% that language (by the above measure).
  • I filtered for languages with > 100 available income data points.

Here are the results for income, sorted from lowest average household income to highest:

Language Average Household Income ($) Data Points
Puppet 87,589.29 112
Haskell 89,973.82 191
PHP 94,031.19 978
CoffeeScript 94,890.80 435
VimL 94,967.11 532
Shell 96,930.54 979
Lua 96,930.69 101
Erlang 97,306.55 168
Clojure 97,500.00 269
Python 97,578.87 2314
JavaScript 97,598.75 3443
Emacs Lisp 97,774.65 355
C# 97,823.31 665
Ruby 98,238.74 3242
C++ 99,147.93 845
CSS 99,881.40 527
Perl 100,295.45 990
C 100,766.51 2120
Go 101,158.01 231
Scala 101,460.91 243
ColdFusion 101,536.70 109
Objective-C 101,801.60 562
Groovy 102,650.86 116
Java 103,179.39 1402
XSLT 106,199.19 123
ActionScript 108,119.47 113

Here’s the same data in chart form:

Language vs Income

Most of the language rankings were roughly in line with my expectations, to the extent I had any:

  • Haskell is a very academic language, and academia is not known for generous salaries
  • PHP is a very accessible language, and it makes sense that casual / younger / lower paid programmers can easily contribute
  • On the high end of the spectrum, Java and ActionScript are used heavily in enterprise software, and enterprise software is certainly known to pay well

On the other hand, I’m unfamiliar with some of the other languages on the high/low ends like XSLT, Puppet, and CoffeeScript.  Any ideas on why these languages ranked higher or lower than average?

Caveats before making too many conclusions from the data here:

  • These are all open-source projects, which may not accurately represent compensation among closed-source developers
  • Rapleaf data does not have total income coverage, and the sample may be biased
  • I have not corrected for any other skew (age, gender, etc)
  • I haven’t crawled all repositories on GitHub, so the users for whom I have data may not be a representative sample

That said, even though the absolute numbers may be biased, I think this is a good starting point when comparing relative compensation between languages.

Let me know any thoughts or suggestions about the methodology or the results.  I’ll follow up soon with age and gender breakdowns per language in a similar fashion.