Using TravisCI to deploy to Maven Central via Git tagging (aka death to commit clutter)

jbool_expressions is a small OSS library I maintain.  To make it easy to use, artifacts are published to Maven Central.  I have never been happy with my process for releasing to Maven Central; the releases were done manually (no CI) on my laptop and felt fragile.

I wanted to streamline this process to meet a few requirements:

  • All deployments should happen from a CI system; nothing depends on laptop state (I don’t want to lose my encryption key when I get a new laptop)
  • Every commit to master should automatically publish a SNAPSHOT artifact to Sonatype (no manual snapshot release process)
  • Cutting a proper release to Maven Central, when needed, should be straightforward and hard to mess up
  • Performing releases should not generate commit clutter.  Namely, no more of this:

Luckily for me, we recently set up a very similar process for our OSS projects at LiveRamp.  I don’t want to claim I figured this all out myself — others at LiveRamp (namely, Josh Kwan) were heavily involved in setting up the encryption parts for the LiveRamp OSS projects which I used as a model.  

There’s a lot of information out there about Maven deploys, but I had trouble finding a simple guide which ties it all together in a painless way, so I decided to write it up here.

tl,dr: through some TravisCI and Maven magic, jbool_expressions now publishes SNAPSHOT artifacts to Sonatype on each master commit, and I can now deploy a release to Maven Central with these commands:

$ git tag 1.23
$ git push origin 1.23

To update and publish the next SNAPSHOT version, I can just change and push the version:

$ mvn versions:set -DnewVersion=1.24-SNAPSHOT
$ git commit -am "Update to version 1.24-SNAPSHOT"
$ git push origin master

At no point is anything auto-committed by Maven; the only commits in the Git history are ones I did manually.  Obviously I could script these last few steps, but I like that these are all primitive commands, with no magic scripts which could go stale or break halfway through.

The thousand-foot view of the CI build I set up for jbool_expressions looks like this:

  • jbool_expressions uses TravisCI to run tests and deploys on every commit to master
  • Snapshots deploy to Sonatype; releases deploy to Maven Central (via a Sonatype staging repository)
  • Tagged commits publish with a version corresponding to the git tag, by using the versions-maven-plugin mid-build.

The rest of this post walks through the setup necessary to get this all working, and points out the important files (if you’d rather just look around, you can just check out the repository itself)

Set up a Sonatype account

Sonatype generously provides OSS projects free artifact hosting and mirroring to Maven Central.  To set up an account with Sonatype OSS hosting, follow the guide here.  After creating the JIRA account for issues.sonatype.org, hold onto the credentials — we’ll use to use those later to publish artifacts.

Creating a “New Project” ticket and getting it approved isn’t an instant process; the approval is manual, because Maven Central requires artifact coordinates to reflect actual domain control.  Since I wanted to publish my artifacts under the com.bpodgursky coordinates, I needed to prove ownership of bpodgursky.com.  

Once a project is approved, we have permission to deploy artifacts two important places:

We’ll use both of these later.

Set up TravisCI for your GitHub project.

TravisCI is easy to set up.  Follow the documentation here.

Configure Maven + Travis to publish to central 

Once we have our Sonatype and Travis accounts enabled, we need to configure a few files to get ourselves publishing to central:

  • pom.xml
  • .travis/maven-settings.xml
  • .travis.yml
  • .travis/gpg.asc.enc
  • deploy

I’ll walk through the interesting parts of each.

pom.xml

Most of the interesting publishing configuration happens in the “publish” profile here.  The three maven plugins — maven-javadoc-plugin, maven-source-plugin, and maven-gpg-plugin — are all standard and generate artifacts necessary for a central deploy (the gpg plugin requires some configuration we’ll do later).  The last one — nexus-staging-maven-plugin — is a replacement for the maven-deploy-plugin, and tells Maven to deploy artifacts through Sonatype.

<distributionManagement>
  <snapshotRepository>
    <id>ossrh</id>
    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
  </snapshotRepository>
  <repository>
    <id>ossrh</id>
    <url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
  </repository>
</distributionManagement>

Fairly straightforward — we want to publish release artifacts to Maven central, but SNAPSHOT artifacts to Sonatype (Maven central doesn’t accept SNAPSHOTs)

.travis/maven-settings.xml

The maven-settings here are configurations to be used during deploys, and inject credentials and buildserver specific configurations:

<servers>
  <server>
    <id>ossrh</id>
    <username>${env.OSSRH_USERNAME}</username>
    <password>${env.OSSRH_PASSWORD}</password>
  </server>
</servers>

The Maven server configurations let us tell Maven how to log into Sonatype so we can deploy artifacts.  Later I’ll explain how we get these variables into the build.

<profiles>
  <profile>
    <id>ossrh</id>
    <activation>
      <activeByDefault>true</activeByDefault>
    </activation>
    <properties>
      <gpg.executable>gpg</gpg.executable>
      <gpg.keyname>${env.GPG_KEY_NAME}</gpg.keyname>
      <gpg.passphrase>${env.GPG_PASSPHRASE}</gpg.passphrase>
    </properties>
  </profile>
</profiles>

The maven-gpg-plugin we configured earlier earlier has a number of user properties which need to be configured.  Here’s where we set them.

.travis.yml

.travis.yml completely defines how and when TravisCI builds a project.   I’ll walk through what we do in this one.

language: java
jdk:
- openjdk10

The header of .travis.yml configures a straightforward Java 10 build.  Note, we can (and do) still build at a Java 8 language level, we just don’t want to use a deprecated JDK.  

install: "/bin/true"
script:
- "./test"

Maven installs the dependencies it needs during building, so we can just disable the “install” phase entirely.  The test script can be inlined if you prefer; it runs a very simple suite:

mvn clean install -q -B

(by running through the “install” phase instead of just “test”, we also catch the “integration-test” and “verify” Maven phases, which are nice things to run in a PR, if they exist).

env:
  global:
  - secure: ...
  - secure: ...
  - secure: ...
  - secure: ...

The next four encrypted secrets were all added via travis.  For more details on how TravisCI handles encrypted secrets, see this article.  Here, these hold four variables we’ll use in the deploy script:

travis encrypt OSSRH_PASSWORD='<sonatype password>' --add env.global
travis encrypt OSSRH_USERNAME='<sonatype username>' --add env.global
travis encrypt GPG_KEY_NAME='<gpg key name>' --add env.global
travis encrypt GPG_PASSPHRASE='<pass>' --add env.global

The first two variables are the Sonatype credentials we created earlier; these are used to authenticate to Sonatype to publish snapshots and release artifacts to central. The last two are for the GPG key we’ll be using to sign the artifacts we publish to Maven central.  Setting up GPG keys and using them to sign artifacts is outside the scope of this post; Sonatype has documented how to set up your GPG key here, and how to use it to sign your Maven artifacts here.

Next we need to set up our GPG key.  To sign our artifacts in a Travis build, we need the GPG key available.  Again, set this up by having Travis encrypt the entire file:

travis encrypt-file .travis/gpg.asc --add

In this case, gpg.asc is the gpg key you want to use to sign artifacts.  This will create .travis/gpg.asc.enc — commit this file, but do not commit gpg.asc.   

Travis will have added a block to .travis.yml that looks something like this:

before_deploy:
- openssl aes-256-cbc -K $encrypted_f094dd62560a_key -iv $encrypted_f094dd62560a_iv -in .travis/gpg.asc.enc -out .travis/gpg.asc -d

Travis defaults to “before_install”, but in this case we don’t need gpg.asc available until the artifact is deployed.

deploy:
-
  skip_cleanup: true
  provider: script
  script: ./deploy
  on:
    branch: master
-
  skip_cleanup: true
  provider: script
  script: ./deploy
  on:
    tags: true

Here we actually set up the artifact to deploy (the whole point of this exercise).  We tell Travis to deploy the artifact under two circumstances: first, for any commit to master; second, if there were tags pushed.  We’ll use the distinction between the two later.  

deploy

The deploy script handles the actual deployment. After a bit of input validation, we get to the important parts:

gpg --fast-import .travis/gpg.asc

This imports the gpg key we decrypted earlier — we need to use this to sign artifacts.

if [ ! -z "$TRAVIS_TAG" ]
then
    echo "on a tag -> set pom.xml <version> to $TRAVIS_TAG"
    mvn --settings "${TRAVIS_BUILD_DIR}/.travis/mvn-settings.xml" org.codehaus.mojo:versions-maven-plugin:2.1:set -DnewVersion=$TRAVIS_TAG 1>/dev/null 2>/dev/null
else
    echo "not on a tag -> keep snapshot version in pom.xml"
fi

This is the important part for eliminating commit clutter.  Whenever a tag is pushed — aka, whenever $TRAVIS_TAG exists — we use the versions-maven-plugin to temporarily set the project’s version to that tag. Specifically, 

$ git tag 1.23
$ git push origin 1.23

The committed artifact version in pom.xml doesn’t change. It doesn’t matter what the version is in pom.xml in master — we want to publish this version as 1.23.

mvn deploy -P publish -DskipTests=true --settings "${TRAVIS_BUILD_DIR}/.travis/mvn-settings.xml"

Last but not least, the actual deploy.  Since we configured our distributionManagement section above with different snapshot and release repositories, we don’t need to think about the version anymore — if it’s still SNAPSHOT (like in the pom), it goes to Sonatype; if we pushed a release tag, it’s headed for central.

That’s it!

Before this setup, I was never really happy with my process for cutting a release and getting artifacts into Maven Central — the auto-generated git commits cluttered the history and it was too easy for a release to fail halfway.  With this process, it’s almost impossible for me to mess up a release. 

Hopefully this writeup helps a few people skip the trial-and-error part of getting OSS Java artifacts released. Let me know if I missed anything, or there’s a way to make this even simpler.

Migrating LiveRamp’s Hadoop infrastructure to GCP

At Google Next 2019, co-workers Sasha Kipervarg, Patrick Raymond and I presented about how we migrated our company’s on-premise big-data Hadoop environment to GCP, from both a technical and cultural perspective.  You can view the presentation and slides on YouTube:

Our Google Next presentation did not give us enough time to go into deep technical details; to give a more in-depth technical view of our migration, I’ve put together a series of blog posts on the LiveRamp engineering blog, with an emphasis on how we migrated our Hadoop environment, the infrastructure my team is responsible for maintaining:

Part 1 where I discuss our Hadoop environment, why we decided to migrate LiveRamp to the cloud, and how we chose GCP.

Part 2 where I discuss the design we chose for our Hadoop environment on GCP.

Part 3 where I discuss the migration process and system architecture which let teams incrementally migrate applications to the cloud.

Part 4 written by coworker Porter Westling, where he discusses how we worked around our data center egress bandwidth restrictions.

Part 5 where I discuss LiveRamp’s use of the cloud going forward, and the cultural changes enabled by migrating to the cloud from an on-premise environment.

LiveRamp’s migration into GCP has been my (and my team’s) primary objective for over a year, and we’ve learned (sometimes painfully) a ton on the way.  Hopefully these articles help others who are planning big-data cloud migrations skip a few painful lessons.

Procedural star rendering with three.js and WebGL shaders

Over the past few months I’ve been working on a WebGL visualization of earth’s solar neighborhood — that is, a 3D map of all stars within 75 light years of Earth, rendering stars and (exo)planets as accurately as possible.  In the process I’ve had to learn a lot about WebGL (specifically three.js, the WebGL library I’ve used).  This post goes into more detail about how I ended up doing procedural star rendering using three.js.  

The first iteration of this project rendered stars as large balls, with colors roughly mapped to star temperature.  The balls did technically tell you where a star was, but it’s not a particularly compelling visual:

uncharted-screenshot

Pretty much any interesting WebGL or OpenGL animation uses vertex and fragment shaders to render complex details on surfaces.  In some cases this just means mapping a fixed image onto a shape, but shaders can also be generated randomly, to represent flames, explosions, waves etc.  three.js makes it easy to attach custom vertex and fragment shaders to your meshes, so I decided to take a shot at semi-realistic (or at least, cool-looking) star rendering with my own shaders.  

Some googling brought me to a very helpful guide on the Seeds of Andromeda dev blog which outlined how to procedurally render stars using OpenGL.  This post outlines how I translated a portion of this guide to three.js, along with a few tweaks.

The full code for the fragment and vertex shaders are on GitHub.  I have images here, but the visuals are most interesting on the actual tool (http://uncharted.bpodgursky.com/) since they are larger and animated.

Usual disclaimer — I don’t know anything about astronomy, and I’m new to WebGL, so don’t assume that anything here is “correct” or implemented “cleanly”.  Feedback and suggestions welcome.

My goal was to render something along the lines of this false-color image of the sun:

sun

In the final shader I implemented:

  • the star’s temperature is mapped to an RGB color
  • noise functions try to emulate the real texture
    • a base noise function to generate granules
    • a targeted negative noise function to generate sunspots
    • a broader noise function to generate hotter areas
  • a separate corona is added to show the star at long distances

Temperature mapping
The color of a star is determined by its temperature, following the black body radiation, color spectrum:

color_temperature_of_a_black_body

(sourced from wikipedia)

Since we want to render stars at the correct temperature, it makes sense to access this gradient in the shader where we are choosing  colors for pixels.  Unfortunately, WebGL limits the size of uniforms to a couple hundred on most hardware, making it tough to pack this data into the shader.

In theory WebGL implements vertex texture mapping, which would let the shader fetch the RGB coordinates from a loaded texture, but I wasn’t sure how to do this in WebGL.  So instead I broke the black-body radiation color vector into a large, horrifying, stepwise function:

bool rbucket1 = i < 60.0; // 0, 255 in 60 bool rbucket2 = i >= 60.0 && i < 236.0;  //   255,255
…
float r =
float(rbucket1) * (0.0 + i * 4.25) +
float(rbucket2) * (255.0) +
float(rbucket3) * (255.0 + (i - 236.0) * -2.442) +
float(rbucket4) * (128.0 + (i - 288.0) * -0.764) +
float(rbucket5) * (60.0 + (i - 377.0) * -0.4477)+
float(rbucket6) * 0.0;

Pretty disgusting.  But it works!  The full function is in the shader here

Plugging in the Sun’s temperature (5,778) gives us an exciting shade of off-white:

sun-no-noise

While beautiful, we can do better.

Base noise function (granules)

Going forward I diverge a bit from the SoA guide.  While the SoA guide chooses a temperature and then varies the intensity of the texture based on a noise function, I instead fix high and low surface temperatures for the star, and use the noise function to vary between them.  The high and low temperatures are passed into the shader as uniforms:

 var material = new THREE.ShaderMaterial({
   uniforms: {
     time: uniforms.time,
     scale: uniforms.scale,
     highTemp: {type: "f", value: starData.temperatureEstimate.value.quantity},
     lowTemp: {type: "f", value: starData.temperatureEstimate.value.quantity / 4}
   },
   vertexShader: shaders.dynamicVertexShader,
   fragmentShader: shaders.starFragmentShader,
   transparent: false,
   polygonOffset: -.1,
   usePolygonOffset: true
 });

All the noise functions below shift the pixel temperature, which is then mapped to an RGB color.

Convection currents on the surface of the sun generate noisy “granules” of hotter and cooler areas.  To represent these granules an available WebGL implementation of 3D simplex noise.    The base noise for a pixel is just the simplex noise at the vertex coordinates, plus some magic numbers (simply tuned to whatever looked “realistic”):

void main( void ) {
float noiseBase = (noise(vTexCoord3D , .40, 0.7)+1.0)/2.0;

The number of octaves in the simplex noise determines the “depth” of the noise, as zoom increases.  The tradeoff of course is that each octave increases the work the GPU computes each frame, so more octaves == fewer frames per second.  Here is the sun rendered at 2 octaves:

sun-2-octaves

4 octaves (which I ended up using):

sun-4-octaves

and 8 octaves (too intense to render real-time with acceptable performance):

sun-8-octaves

Sunspots

Sunspots are areas on the surface of a star with a reduced surface temperature due to magnetic field flux.  My implementation of sunspots is pretty simple; I take the same noise function we used for the granules, but with a decreased frequency, higher amplitude and initial offset.  By only taking the positive values (the max function), the sunspots show up as discrete features rather than continuous noise.  The final value (“ss”) is then subtracted from the initial noise.

float frequency = 0.04;
float t1 = snoise(vTexCoord3D * frequency)*2.7 -  1.9;
float ss = max(0.0, t1);

This adds only a single snoise call per pixel, and looks reasonably good:

sunspots-impl

Additional temperature variation

To add a bit more noise, the noise function is used one last time, this time to add temperature in broader areas, for a bit more noise:

float brightNoise= snoise(vTexCoord3D * .02)*1.4- .9;
float brightSpot = max(0.0, brightNoise);

float total = noiseBase - ss + brightSpot;

All together, this is what the final shader looks like:

sun-final

Corona

Stars are very small, on a stellar scale.  The main goal of this project is to be able to visually hop around the Earth’s solar neighborhood, so we need to be able to see stars at a long distance (like we can in real life).  

The easiest solution is to just have a very large fixed sprite attached at the star’s location.  This solution has some issues though:

  • being inside a large semi-opaque sprite (ex, when zoomed up towards a star) occludes vision of everything else
  • scaled sprites in Three.js do not play well with raycasting (the raycaster misses the sprite, making it impossible to select stars by mousing over them)
  • a fixed sprite will not vary its color by star temperature

I ended up implementing a shader which implemented a corona shader with

  • RGB color based on the star’s temperature (same implementation as above)
  • color near the focus trending towards pure white
  • size was proportional to camera distance (up to a max distance)
  • a bit of lens flare (this didn’t work very well)

Full code here.  Lots of magic constants for aesthetics, like before.

Close to the target star, the corona is mostly occluded by the detail mesh:

corona-close

At a distance the corona remains visible:

corona-distance

On a cooler (temperature) star:

corona-flare

The corona mesh serves two purposes

  • calculating intersections during raycasting (to enable targeting stars via mouseover and clicking)
  • star visibility

Using a custom shader to implement both of these use-cases let me cut the number of rendered three.js meshes in half; this is great, because rendering half as many objects means each frame renders twice as quickly.

Conclusions

This shader is a pretty good first step, but I’d like to make a few improvements and additions when I have a chance:

  • Solar flares (and other 3D surface activity)
  • More accurate sunspot rendering (the size and frequency aren’t based on any real science)
  • Fix coronas to more accurately represent a star’s real visual magnitude — the most obvious ones here are the largest ones, not necessarily the brightest ones

My goal is to follow up this post a couple others about parts of this project I think turned out well, starting with the orbit controls (the logic for panning the camera around a fixed point while orbiting).  

3D map of Solar Neighborhood using three.js (again!)

A few years ago I posted about a WebGL visualization of the neighborhood around our sun.  It was never as polished as I wanted, so on-and-off over the past few months I’ve been working on making it more interesting.  The project is still located here:

http://uncharted.bpodgursky.com/

The code is still hosted on GitHub:

https://github.com/bpodgursky/uncharted

Two of the improvements I’m especially excited about.  First the star rendering now uses glsl shaders which are based on the star’s temperature, giving cool (and animated!) visuals:

alpha2centauri

Second, all known exoplanets (planets orbiting stars besides our Sun) are rendered around their parent stars.  The textures here are of course fake, but the orbits are accurate where the data is known:

rocky-planet

I’ve also included all the full planets in our solar system with full textures and (hopefully accurate) orbits:

mercury

I’ve updated the README on the GitHub project with all the changes (I’ve also totally reworked the controls).

I’m going to try to write some more granular posts about what actually went into the three.js and glsl to implement this, since I learned a ton in the process.

 

Catalog of Life Taxonomic Tree

A small visualization I’ve wanted to do for a while is a tree of life graph — visualizing all known species and their relationships.

Recently I found that the Catalog of Life project has a very accessible database of all known species / taxonomic groups and their relationships, available for download here.  This let me put together a simple site backed by their database, available here:

http://taxontree.bpodgursky.com/

uncharted-screenshot

All the source code is available on Github.

Design

I’ve used dagre + d3 on a number of other graph visualization projects, so dagre-d3 was the natural choice for the  visualization component.  The actual code required to do the graph building and rendering is pretty trivial.

The data fetching was a bit trickier.  Since pre-loading tens of millions of records was obviously unrealistic, I had to implement a graph class (BackedBiGraph) which lazily expands and collapses, using user-provided callbacks to fetch new data.  In this case, the callbacks were just ajax calls back to the server.

The Catalog of Life database did not come with a Java client, so I thought this would be a good opportunity to use jOOQ to generate Java models and query builders corresponding to the COL database, since I am allergic to writing actual SQL queries.  This ended up working very well — configuring the jOOQ Maven plugin was simple, and the generated code made writing the queries trivial:

 private Collection&lt;TaxonNodeInfo&gt; taxonInfo(Condition condition) {
return context.select()
.from(TAXON)
.leftOuterJoin(TAXON_NAME_ELEMENT)
.on(TAXON.ID.equal(TAXON_NAME_ELEMENT.TAXON_ID))
.leftOuterJoin(SCIENTIFIC_NAME_ELEMENT)
.on(SCIENTIFIC_NAME_ELEMENT.ID.equal(TAXON_NAME_ELEMENT.SCIENTIFIC_NAME_ELEMENT_ID))
.where(condition).fetch().stream().map(record -&gt; new TaxonNodeInfo(
record.into(TAXON),
record.into(TAXON_NAME_ELEMENT),
record.into(SCIENTIFIC_NAME_ELEMENT)
)).collect(Collectors.toList());
}

All in all, there are a lot of rough edges still, but dagre, d3 and jOOQ made this a much easier project than expected.  The code is on Github, so suggestions, improvements, or bugfixes are always welcome.

 

 

D3 NLP Visualizer Update

A couple years ago I put together a simple NLP parse tree visualizer demo which used d3 and the dagre layout library.  At the time, integrating dagre with d3 required a bit of manual work (or copy-paste); however, since then dagre-d3 library was split out of dagre which added an easy API for adding and removing nodes.  Even though the library isn’t under active development, I think it’s still the most powerful pure-JS directed graph layout library out there.

An example from the wiki shows how simple the the dagre-d3 API is for creating directed graphs, abbreviated here:

    // Create the input graph
    var g = new dagreD3.graphlib.Graph()
      .setGraph({})
      .setDefaultEdgeLabel(function() { return {}; });

    // Here we"re setting nodeclass, which is used by our custom drawNodes function
    // below.
    g.setNode(0,  { label: "TOP", class: "type-TOP" });

    // ... snip

    // Set up edges, no special attributes.
    g.setEdge(3, 4);
    
    // ... snip
    // Create the renderer
    var render = new dagreD3.render();

    var svg = d3.select("svg"),
        svgGroup = svg.append("g");

    // Run the renderer. This is what draws the final graph.
    render(d3.select("svg g"), g);

Since that original demo site still gets a fair amount of traffic, I thought it would be good to update it to use dagre-d3 instead of the original hand-rolled bindings (along with other cleanup).  You can see the change set required here.

The other goal was to re-familiarize myself with d3 and dagre, since I have a couple projects in mind which would make heavy use of both.  Hopefully I’ll have something to post here in the next couple months.

Simple Boolean Expression Manipulation in Java

I’ve worked on a couple projects recently where I needed to be able to do some lightweight propositional expression manipulation in Java.  Specifically, I wanted to be able to:

  • Let a user input simple logical expressions, and parse them into Java data structures
  • Evaluate the truth of the statement given values for each variable
  • Incrementally update the expression as values are assigned to the variables
  • If the statement given some variable assignments is not definitively true or false, show which terms remain.
  • Perform basic simplification of redundant terms (full satisfiability is of course NP hard, so this would only include basic simplification)

I couldn’t find a Java library which made this particularly easy; a couple stackoverflow questions I found didn’t have any particularly easy solutions.  I decided to take a shot at implementing a basic library.  The result is on GitHub as the jbool_expressions library.

(most of the rest of this is copied from the README, so feel free to read it there.)

Using the library, a basic propositional expression is built out of the types And, Or, Not, Variable and Literal. All of these extend the base type Expression.  An Expression can be built programatically:

    Expression expr = And.of(
        Variable.of("A"),
        Variable.of("B"),
        Or.of(Variable.of("C"), Not.of(Variable.of("C"))));
    System.out.println(expr);

or by parsing a string:

    Expression expr =
        ExprParser.parse("( ( (! C) | C) & A & B)");
    System.out.println(expr);

The expression is the same either way:

    ((!C | C) & A & B)

We can do some basic simplification to eliminate the redundant terms:

    Expression simplified = RuleSet.simplify(expr);
    System.out.println(expr);

to see the redundant terms are simplified to “true”:

    (A & B)

We can assign a value to one of the variables, and see that the expression is simplified after assigning “A” a value:

    Expression halfAssigned = RuleSet.assign(
        simplified,
        Collections.singletonMap("A", true)
    );
    System.out.println(halfAssigned);

We can see the remaining expression:

    B

If we assign a value to the remaining variable, we can see the expression evaluate to a literal:

    Expression resolved = RuleSet.assign(
        halfAssigned,
         Collections.singletonMap("B", true)
     );
    System.out.println(resolved);
    true

All expressions are immutable (we got a new expression back each time we performed an operation), so we can see that the original expression is unmodified:

    System.out.println(expr);
    ((!C | C) & A & B)

Expressions can also be converted to sum-of-products form:

    Expression nonStandard = PrefixParser.parse(
        "(* (+ A B) (+ C D))"
    );

    System.out.println(nonStandard);

    Expression sopForm = RuleSet.toSop(nonStandard);
    System.out.println(sopForm);
    ((A | B) & (C | D))
    ((A & C) | (A & D) | (B & C) | (B & D))

You can build the library yourself or grab it via maven:


<dependency>
    <groupId>com.bpodgursky</groupId>
    <artifactId>jbool_expressions</artifactId>
    <version>1.3</version>
</dependency>

Happy to hear any feedback / bugs / improvements etc. I’d also be interested in hearing how other people have dealt with this problem, and if there are any better libraries out there.

Average Income per Programming Language

Update 8/21:  I’ve gotten a lot of feedback about issues with these rankings from comments, and have tried to address some of them here The data there has been updated to include confidence intervals.

———————————————————————————————————

A few weeks ago I described how I used Git commit metadata plus the Rapleaf API to build aggregate demographic profiles for popular GitHub organizations (blog post here, per-organization data available here).

I was also interested in slicing the data somewhat differently, breaking down demographics per programming language instead of per organization.  Stereotypes about developers of various languages abound, but I was curious how these lined up with reality.  The easiest place to start was age, income, and gender breakdowns per language. Given the data I’d already collected, this wasn’t too challenging:

  • For each repository I used GitHub’s estimate of a repostory’s language composition.  For example, GitHub estimates this project at 75% Java.
  • For each language, I aggregated incomes for all developers who have contributed to a project which is at least 50% that language (by the above measure).
  • I filtered for languages with > 100 available income data points.

Here are the results for income, sorted from lowest average household income to highest:

Language Average Household Income ($) Data Points
Puppet 87,589.29 112
Haskell 89,973.82 191
PHP 94,031.19 978
CoffeeScript 94,890.80 435
VimL 94,967.11 532
Shell 96,930.54 979
Lua 96,930.69 101
Erlang 97,306.55 168
Clojure 97,500.00 269
Python 97,578.87 2314
JavaScript 97,598.75 3443
Emacs Lisp 97,774.65 355
C# 97,823.31 665
Ruby 98,238.74 3242
C++ 99,147.93 845
CSS 99,881.40 527
Perl 100,295.45 990
C 100,766.51 2120
Go 101,158.01 231
Scala 101,460.91 243
ColdFusion 101,536.70 109
Objective-C 101,801.60 562
Groovy 102,650.86 116
Java 103,179.39 1402
XSLT 106,199.19 123
ActionScript 108,119.47 113

Here’s the same data in chart form:

Language vs Income

Most of the language rankings were roughly in line with my expectations, to the extent I had any:

  • Haskell is a very academic language, and academia is not known for generous salaries
  • PHP is a very accessible language, and it makes sense that casual / younger / lower paid programmers can easily contribute
  • On the high end of the spectrum, Java and ActionScript are used heavily in enterprise software, and enterprise software is certainly known to pay well

On the other hand, I’m unfamiliar with some of the other languages on the high/low ends like XSLT, Puppet, and CoffeeScript.  Any ideas on why these languages ranked higher or lower than average?

Caveats before making too many conclusions from the data here:

  • These are all open-source projects, which may not accurately represent compensation among closed-source developers
  • Rapleaf data does not have total income coverage, and the sample may be biased
  • I have not corrected for any other skew (age, gender, etc)
  • I haven’t crawled all repositories on GitHub, so the users for whom I have data may not be a representative sample

That said, even though the absolute numbers may be biased, I think this is a good starting point when comparing relative compensation between languages.

Let me know any thoughts or suggestions about the methodology or the results.  I’ll follow up soon with age and gender breakdowns per language in a similar fashion.

Fast asymmetric Hadoop joins using Bloom Filters and Cascading

In a recent post for the Liveramp blog I describe how we use Bloom filters to optimize our Hadoop jobs:

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

Check out the rest of the post here.