3d Map of our Solar Neighborhood using three.js

A few months ago I stumbled on three.js, a library which exposes a simple WebGL interface.  I was really impressed at both the performance of WebGL and how easy three.js made building high-performance animations in the browser.

I thought this would be a good opportunity to put together a visualization I’ve been looking for for a while–a map of our solar neighborhood, showing all the closest stars to our own.  The data-set I used was the HYG database of nearby stars, which is a compilation of all stars within 50 parsecs.  I’ve cross-referenced stars against wikipedia where available.   The site is available here:

http://uncharted.bpodgursky.com/

Screenshot:

uncharted-screenshot

The whole project is open-source, hosted here.

The project isn’t nearly as polished as I would have liked (there’s a long to-do list on the github page), but I’m trying to commit to releasing projects rather than letting them die silently.  I’m hoping to be able to iterate on the remaining issues over the next few months.

Thoughts, suggestions, or contributions always welcome here or on the GitHub page.

Posted in Uncategorized | Leave a comment

Simple Boolean Expression Manipulation in Java

I’ve worked on a couple projects recently where I needed to be able to do some lightweight propositional expression manipulation in Java.  Specifically, I wanted to be able to:

  • Let a user input simple logical expressions, and parse them into Java data structures
  • Evaluate the truth of the statement given values for each variable
  • Incrementally update the expression as values are assigned to the variables
  • If the statement given some variable assignments is not definitively true or false, show which terms remain.
  • Perform basic simplification of redundant terms (full satisfiability is of course NP hard, so this would only include basic simplification)

I couldn’t find a Java library which made this particularly easy; a couple stackoverflow questions I found didn’t have any particularly easy solutions.  I decided to take a shot at implementing a basic library.  The result is on GitHub as the jbool_expressions library.

(most of the rest of this is copied from the README, so feel free to read it there.)

Using the library, a basic propositional expression is built out of the types And, Or, Not, Variable and Literal. All of these extend the base type Expression.  An Expression can be built programatically:

    Expression expr = And.of(
        Variable.of("A"),
        Variable.of("B"),
        Or.of(Variable.of("C"), Not.of(Variable.of("C"))));
    System.out.println(expr);

or by parsing a string:

    Expression expr =
        ExprParser.parse("( ( (! C) | C) & A & B)");
    System.out.println(expr);

The expression is the same either way:

    ((!C | C) & A & B)

We can do some basic simplification to eliminate the redundant terms:

    Expression simplified = RuleSet.simplify(expr);
    System.out.println(expr);

to see the redundant terms are simplified to “true”:

    (A & B)

We can assign a value to one of the variables, and see that the expression is simplified after assigning “A” a value:

    Expression halfAssigned = RuleSet.assign(
        simplified,
        Collections.singletonMap("A", true)
    );
    System.out.println(halfAssigned);

We can see the remaining expression:

    B

If we assign a value to the remaining variable, we can see the expression evaluate to a literal:

    Expression resolved = RuleSet.assign(
        halfAssigned,
         Collections.singletonMap("B", true)
     );
    System.out.println(resolved);
    true

All expressions are immutable (we got a new expression back each time we performed an operation), so we can see that the original expression is unmodified:

    System.out.println(expr);
    ((!C | C) & A & B)

Expressions can also be converted to sum-of-products form:

    Expression nonStandard = PrefixParser.parse(
        "(* (+ A B) (+ C D))"
    );

    System.out.println(nonStandard);

    Expression sopForm = RuleSet.toSop(nonStandard);
    System.out.println(sopForm);
    ((A | B) & (C | D))
    ((A & C) | (A & D) | (B & C) | (B & D))

You can build the library yourself or grab it via maven:


<dependency>
    <groupId>com.bpodgursky</groupId>
    <artifactId>jbool_expressions</artifactId>
    <version>1.3</version>
</dependency>

Happy to hear any feedback / bugs / improvements etc. I’d also be interested in hearing how other people have dealt with this problem, and if there are any better libraries out there.

Posted in Algorithms, Github, Open Source | Leave a comment

The future looks familiar

One thing that has bothered me about the technological progress we’ve seen over the past few decades is how invisible it is to our daily lives.

I took a trip to IKEA recently to get a couch for our apartment.  For those who have never been to an IKEA, it is an enormous store where furniture is sold both a-la-carte and as fully furnished rooms (for those with cash and no patience for design.)  I didn’t think to take pictures at the time, but the example kitchens looked something like this image I ripped from the IKEA website:

ikea-kitchen

The kitchen is of course more beautiful than any kitchen I will ever own.  But what bothers me is, if you had shown me that room in 1983 I doubt I would have realized I was looking 30 years into the future.  It’s difficult for me to find a single tool or utensil which would have been unrecognizable; perhaps the stove is slightly more programmable, or the lightbulbs are CFCs, but we’re basically interacting with the kitchen using the same tools people were using thirty years ago.

A quick search gives a rough timeline of kitchen inventions here (if anyone has a more complete list I’d be interested in seeing it).

  • 1909:  First commercially successful electric toaster
  • 1913:  First electric dishwasher on the market
  • 1919:  First automatic pop-up toaster
  • 1927:  First garbage disposal
  • 1945:  Magnetron discovered to melt candy, pop corn, and cook an egg (microwave)
  • 1952:  First automatic coffeepot
  • 1963:  GE introduces the self-cleaning oven

Each of these innovations freed up enormous amounts of human capital to solve more interesting problems; ten minutes a day that 150 million people spend not brewing coffee adds up to 13,000 extra lifetimes each year.

But since then, what really radical improvements have we seen?  I was curious and did a casual inspection of my own kitchen.  The newest invention I can find is a George Foreman grill, which was invented 19 years ago.  Have we really perfected the art of cooking, or did we just give up and start solving other problems?

If I see that kitchen when I visit IKEA in 2043, I’m going to be very disappointed.  All the film and television depictions of the future from my childhood envisioned radically faster and more efficient household appliances.  Back to the Future 2 had food hydrators:

hydrator

and robotic waiters when not at home:

robot-waiter

Star trek jumped straight to food replicators:

replicators

I never watched the Jetsons, but I do know they had a robotic maid:

jetsons

Outside the kitchen I feel the same way about transportation.  To get to work, I take this bus (albeit a different route):

sf-bus

Bus rides in San Francisco are uniformly slow, crowded, and miserable; a bus ride in downtown San Francisco averages a blazing five miles an hour.  I refuse to believe that as a society we cannot beat this form of transportation (the underground lines are somewhat faster, but are now extraordinarily expensive to build.)

I’m not contesting that technology overall is improving as rapidly as it was fifty years ago.  The transistor count density increases which power Moore’s law may be slowing, but seem to be more than offset by improvements in parallel processing, and mobile phones and the internet have radically changed how we work and interact with each other.   But I can’t shake the feeling that we’ve stopped trying to optimize the routines which still take hours out of each day.

Posted in Uncategorized | Leave a comment

Taxi Loading at SFO

I usually avoid catching Taxis whenever possible, but when I arrived in SFO last week the trains were no longer running and I hadn’t arranged for a shuttle, so I ended up waiting in line to catch a Taxi.  The line was structured something like this:

Taxi line 1 (1)

  • There was a loading area about four cars long where Taxis were loading passengers
  • Would-be passengers waited in line along the curb to the left, waiting for a Taxi
  • Likewise, taxis waited in line for passengers on the other side of the curb
  • As people loaded into Taxis and departed, each line advanced to the right, matching the front of the Taxi line with the front of the passenger line
  • An airport employee stood stood near the front of the line, shepherding people and cabs around to enforce this flow

Of course, this felt like an extremely inefficient system; I was waiting next to a cab which was waiting for a passenger; had we been allowed, I would have just jumped in the cab next to me and we both would have been happier.  However, since the line of people was denser than the Taxi line, I would have been cutting in front of other people in line.

In college I took a couple classes where we learned about queuing algorithms and the standard trade-offs involved.  On the ride back I thought about how they applied to the Taxi-loading situation here:

  • Throughput: how many passengers per hour could the system match to Taxis?   This was not being optimized for, or I could have gotten into the Taxi beside me.
  • Fairness: this was pretty clearly what was being optimized for–both the Taxi line and the passenger line were being processed in First-In-First-Out (FIFO) ordering. 
  • Average wait time:  I don’t think wait time was being taken into account, especially since passengers with less luggage (and therefore faster loading) would have been given priority over passengers with many bags.

A couple other issues were specific to this situation:

  • The matching process should not involve an inordinate amount of walking by prospective passengers (a passenger should never have to walk the entire length of the Taxi queue to find a cab)
  • If cabs frequently have to pass other cabs to advance to the head of the queue, it increases the odds of an accident (or of getting run over, if you are loading your bags into the trunk.)

I’d like to think that a better system exists (“there has to be a better way!”), even if it sacrifices some amount of fairness, since clearly this system would scale poorly if the airport was busier.

If anyone knows of airports/malls/etc that do a better job, I’d be interested in knowing how they manage it.  I didn’t waste an enormous amount of time in line (~10 minutes), but if the line is on average 50 people long, that’s actually a huge amount of time being squandered over the course of a year.

Posted in Algorithms | 1 Comment

Updates to language vs income breakdown post

Thanks to everyone who commented and read through my post last night.  The post got a lot more attention than I expected (on hacker news and reddit at least).    Many comments both here and on those threads quite reasonably pointed out problems with the data presented.  I should have been a lot more clear initially about the caveats and issues, and put those at the front of the post instead of the end.

I’d like to try to address some of the concerns raised when possible, and be clear about which problems I don’t see an easy way of fixing:

Confidence intervals

Many commenters have noted that the results are not significant without including confidence measures.  In retrospect I should have calculated confidence intervals from the beginning instead of just the mean values; I had assumed incorrectly that the n=100 cutoff would keep the error low enough to ignore, but that was a mistake.  Below is an updated graph with 95% confidence intervals:

incomes

and the numbers:

Language Mean Lower Upper Samples
Puppet 87,589.29 77,726.24 97,452.33 112
Haskell 89,973.82 82,773.72 97,173.92 191
PHP 94,031.19 90,956.90 97,105.47 978
CoffeeScript 94,890.80 90,025.16 99,756.45 435
VimL 94,967.11 90,735.70 99,198.51 532
Shell 96,930.54 93,771.76 100,089.33 979
Lua 96,930.69 86,169.26 107,692.13 101
Erlang 97,306.55 88,631.11 105,981.98 168
Clojure 97,500.00 91,448.24 103,551.76 269
Python 97,578.87 95,481.64 99,676.10 2314
JavaScript 97,598.75 95,897.67 99,299.83 3443
Emacs Lisp 97,774.65 92,503.64 103,045.65 355
C# 97,823.31 94,116.76 101,529.86 665
Ruby 98,238.74 96,471.81 100,005.68 3242
C++ 99,147.93 95,633.62 102,662.23 845
CSS 99,881.40 95,361.99 104,400.82 527
Perl 100,295.45 97,172.79 103,418.12 990
C 100,766.51 98,602.83 102,930.19 2120
Go 101,158.01 94,435.87 107,880.15 231
Scala 101,460.91 94,925.79 107,996.02 243
ColdFusion 101,536.70 93,627.35 109,446.05 109
Objective-C 101,801.60 97,560.43 106,042.77 562
Groovy 102,650.86 94,601.74 110,699.99 116
Java 103,179.39 100,474.36 105,884.42 1402
XSLT 106,199.19 96,887.72 115,510.65 123
ActionScript 108,119.47 99,297.36 116,941.58 113

As it turns out, the commenters who noted that the top and bottom languages were likely because of small samples were correct.  Although the confidence ranges of the top and bottom groups don’t overlap, the difference is not as clear-cut as the means would suggest.

I’m going to try to gather some data from sparser-represented languages to clean this up, and will update here when I have better numbers (this might take a while because of API rate limiting.)

Household Income vs Personal Income

Many commenters noted that these numbers use household income rather than personal income.  This is a limitation of the data sets I’m using rather than voluntary; the Rapleaf API only returns household income.  Rather than give up I decided to use the household measure instead.

This is not ideal, but I don’t think it is a critical flaw; for this difference to skew the results, authors of certain languages would need significantly different marriage patterns or a tendency to marry richer / poorer spouses relative to other languages.  This is not impossible, but I think the results are still useful with this caveat in mind.

If anyone can suggest a data set with personal incomes I can use instead, I’ll gladly use those.  Otherwise I’ll be more clear that the incomes are household rather than personal.

Correcting for Confounding Variables

The original numbers did not attempt to adjust for any other variables, some of the more obvious being age and location.  It’s been suggested that I look into using partial dependence plots to separate out other variables.  I’ll be taking a look at that over the next few days.

Missing Languages

Unfortunately there’s not a lot I can do about many missing languages; many are not recognized by GitHub (SQL, among others).  As I gather more data, I’ll include the languages which were omitted here because of sample size.

Thanks again to everyone who read and commented.  I’m going to process the lessons here and be more careful when posting numbers in the future (I’d still like to give similar breakdowns for gender and age soon.)

Posted in Uncategorized | 18 Comments

Average Income per Programming Language

Update 8/21:  I’ve gotten a lot of feedback about issues with these rankings from comments, and have tried to address some of them here The data there has been updated to include confidence intervals.

———————————————————————————————————

A few weeks ago I described how I used Git commit metadata plus the Rapleaf API to build aggregate demographic profiles for popular GitHub organizations (blog post here, per-organization data available here).

I was also interested in slicing the data somewhat differently, breaking down demographics per programming language instead of per organization.  Stereotypes about developers of various languages abound, but I was curious how these lined up with reality.  The easiest place to start was age, income, and gender breakdowns per language. Given the data I’d already collected, this wasn’t too challenging:

  • For each repository I used GitHub’s estimate of a repostory’s language composition.  For example, GitHub estimates this project at 75% Java.
  • For each language, I aggregated incomes for all developers who have contributed to a project which is at least 50% that language (by the above measure).
  • I filtered for languages with > 100 available income data points.

Here are the results for income, sorted from lowest average household income to highest:

Language Average Household Income ($) Data Points
Puppet 87,589.29 112
Haskell 89,973.82 191
PHP 94,031.19 978
CoffeeScript 94,890.80 435
VimL 94,967.11 532
Shell 96,930.54 979
Lua 96,930.69 101
Erlang 97,306.55 168
Clojure 97,500.00 269
Python 97,578.87 2314
JavaScript 97,598.75 3443
Emacs Lisp 97,774.65 355
C# 97,823.31 665
Ruby 98,238.74 3242
C++ 99,147.93 845
CSS 99,881.40 527
Perl 100,295.45 990
C 100,766.51 2120
Go 101,158.01 231
Scala 101,460.91 243
ColdFusion 101,536.70 109
Objective-C 101,801.60 562
Groovy 102,650.86 116
Java 103,179.39 1402
XSLT 106,199.19 123
ActionScript 108,119.47 113

Here’s the same data in chart form:

Language vs Income

Most of the language rankings were roughly in line with my expectations, to the extent I had any:

  • Haskell is a very academic language, and academia is not known for generous salaries
  • PHP is a very accessible language, and it makes sense that casual / younger / lower paid programmers can easily contribute
  • On the high end of the spectrum, Java and ActionScript are used heavily in enterprise software, and enterprise software is certainly known to pay well

On the other hand, I’m unfamiliar with some of the other languages on the high/low ends like XSLT, Puppet, and CoffeeScript.  Any ideas on why these languages ranked higher or lower than average?

Caveats before making too many conclusions from the data here:

  • These are all open-source projects, which may not accurately represent compensation among closed-source developers
  • Rapleaf data does not have total income coverage, and the sample may be biased
  • I have not corrected for any other skew (age, gender, etc)
  • I haven’t crawled all repositories on GitHub, so the users for whom I have data may not be a representative sample

That said, even though the absolute numbers may be biased, I think this is a good starting point when comparing relative compensation between languages.

Let me know any thoughts or suggestions about the methodology or the results.  I’ll follow up soon with age and gender breakdowns per language in a similar fashion.

Posted in Github, Open Source, Uncategorized, Visualization | 193 Comments

Using CoreNLP, d3.js, and dagre.js to visualize sentence parse trees

I’ve always been casually interested in the field of Natural Langauge Processing (NLP), a  field of computer science interested in extracting information from natural human language. I have no training or education whatsoever in the field so I’m not in a position to contribute much to the field, but I am definitely interested in seeing where the state of the art is, and in particular how powerful open-source NLP libraries have gotten (Google and Microsoft certainly have more powerful closed-source systems, but that doesn’t really help me.)

A few years ago I started playing with Apache’s OpenNLP project.  I’m a big fan of the Apache foundation and their libraries, but I found myself very frustrated by OpenNLP’s lack of documentation and the hacky-feeling interfaces the library exposed.  However recently I took another look at the available NLP libraries and came across Stanford’s CoreNLP project.   CoreNLP, as it turns out, is an awesome project, and it took almost zero effort to get their example demo working.

As a total NLP beginnner, the sentence parsing functionality was the most immediately approachable example.  Sentence parsing takes a natural-English sentence:

“I am parsing an example sentence.”

and breaks it down into component tokens and their relations:

(ROOT (S (NP (PRP I)) (VP (VBP am) (VP (VBG parsing) (NP (DT an) (NN example) (NN sentence)))) (. .)))

where each token type corresponds to a particular word type–“NP”  means “Noun Phrase”, VBG means “Verb, gerund or present participle”, and so forth (I’ve been referencing this as a complete token list.)

I’ve also been looking into JavaScript graph visualization libraries recently (I’ve struggled to find a JS library remotely as powerful and pretty as graphviz), and wanted to test out the dagre library, which re-implements a simplified dot algorithm in javascipt and can render the results to d3 (the current coolest-kid-on-the-block JS graph library).  So I put the two together and put together a simple visualization which uses dagre to show CoreNLP’s sentence parse tree.  It’s pretty simple, but you can play with it here.

nlp-screenshot-cropped

When I have time to work with the two libraries a bit more I’ll hopefully update with something more interesting.

Posted in Uncategorized | Leave a comment