Cascading
Encyclopedia
Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java
, JRuby
, Clojure
, etc.), hiding the underlying complexity of MapReduce
jobs. It is open source and available under the GPL license. Commercial OEM licenses are available from Concurrent, Inc.
Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc. Cascading is being actively developed by the community and a number of add-on modules are available.
Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks. Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.
Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application.
Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications.
Other users are listed on the cascading.org site.
Java
Java is an island of Indonesia. With a population of 135 million , it is the world's most populous island, and one of the most densely populated regions in the world. It is home to 60% of Indonesia's population. The Indonesian capital city, Jakarta, is in west Java...
, JRuby
JRuby
JRuby is a Java implementation of the Ruby programming language, being developed by the JRuby team. It is free software released under a three-way CPL/GPL/LGPL license...
, Clojure
Clojure
Clojure |closure]]") is a recent dialect of the Lisp programming language created by Rich Hickey. It is a general-purpose language supporting interactive development that encourages a functional programming style, and simplifies multithreaded programming....
, etc.), hiding the underlying complexity of MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
jobs. It is open source and available under the GPL license. Commercial OEM licenses are available from Concurrent, Inc.
Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc. Cascading is being actively developed by the community and a number of add-on modules are available.
Architecture
To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler.Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks. Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.
Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application.
Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications.
Uses of Cascading
Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011, as a major open source project relevant to bioinformatics and is included in Hadoop: A Definitive Guide, by Tom White. The project is also widely cited in presentations, conference proceedings and Hadoop user group meetings as a useful tool for working with Hadoop.- MultiTool on Amazon Web ServicesAmazon Web ServicesAmazon Web Services is a collection of remote computing services that together make up a cloud computing platform, offered over the Internet by Amazon.com...
was developed using Cascading. - LogAnalyzer for Amazon CloudFrontAmazon CloudFrontAmazon CloudFront is a content delivery network offered by Amazon Web Services. CloudFront operates on a pay-as-you-go basis. The service was launched in Beta on November 18, 2008....
was developed using Cascading. - BackType - social analytics platform
- Etsy - marketplace
- FlightCaster - predicting flight delays
- Ion Flux - analyzing DNA sequence data
- RapLeaf - personalization and recommendation systems
- Razorfish - digital advertising
Other users are listed on the cascading.org site.
Domain-Specific Languages Built on Cascading
- Cascading.jruby - developed by Gregoire Marabout, available on GitHub
- Cascalog - authored by Nathan Marz, available on GitHub