Fast asymmetric Hadoop joins using Bloom Filters and Cascading

In a recent post for the Liveramp blog I describe how we use Bloom filters to optimize our Hadoop jobs:

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

Check out the rest of the post here.

Advertisements
This entry was posted in Open Source and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s