Fast asymmetric Hadoop joins using Bloom Filters and Cascading

bpodgursky Open Source April 10, 2013July 31, 2015 1 Minute

In a recent post for the Liveramp blog I describe how we use Bloom filters to optimize our Hadoop jobs:

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

Check out the rest of the post here.

Tagged
hadoop
java

Published by bpodgursky

View all posts by bpodgursky

Published April 10, 2013July 31, 2015

Leave a comment Cancel reply