Carrot2
Encyclopedia
Carrot² is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for fetching search results from various sources. Carrot² is written in Java and distributed under the BSD license.
.
Currently, Carrot² has built-in support for the following document sources:
Other document sources can be easily integrated based on the code examples
provided with Carrot² distribution.
Other algorithms can be easily added to Carrot².
software without installing a Java runtime. The Carrot² C# API requires .NET Framework
version 3.5 or later.
service exposed by the Document Clustering Server. Example integration code is provided for PHP5, C#, Ruby
and CURL
.
History
The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish. In 2003, a number of other search results clustering algorithms were added, including Lingo, a novel text clustering algorithm designed specifically for clustering of search results. While the source code of Carrot² was available since 2002, it was only in 2006 when version 1.0 was officially released. In the same year, version 2.0 was released with improved user interface and extended tool set. In 2009, version 3.0 brought significant improvements in clustering quality, simplified API and new GUI application for tuning clustering based on the Eclipse Rich Client PlatformRich Client Platform
A rich client platform is software consisting of the following components:* A core , lifecycle manager* A standard bundling framework* A portable widget toolkit* File buffers, text handling, text editors...
.
Release | Release Date | Major changes and new features |
---|---|---|
3.5.2 | September 2011 | Ajax support in Document Clustering Server, Bing document source improved, Workbench improvements, bug fixes. |
3.5.1 | June 2011 | Bug fixes, visualization integration improvements, support for Yahoo BOSS API removed. |
3.5.0 | May 2011 | FoamTree visualization, bisecting k-means clustering, resource management improvements |
3.4.3 | March 2011 | Distribution to Maven Apache Maven Maven is a build automation and software comprehension tool. While primarily used for Java programming, it can also be used to build and manage projects written in C#, Ruby, Scala, and other languages. Maven serves a similar purpose to the Apache Ant tool, but it is based on different concepts and... central repository |
3.4.2 | October 2010 | Bug fixes |
3.4.1 | September 2010 | Solr 1.4.x compatibility package, bug fixes |
3.4.0 | August 2010 | .NET API for calling Carrot² clustering |
3.3.0 | April 2010 | Significant scalability improvements in the STC clustering algorithm |
3.2.0 | March 2010 | Experimental support for clustering Arabic and Korean content, command line application for clustering in batch mode, LGPL-licensed dependencies removed |
3.1.0 | September 2009 | Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr |
3.1.0 | September 2009 | Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr |
3.0.1 | March 2009 | Document Clustering Workbench available for Mac OS X |
3.0.0 | January 2009 | Document Clustering Workbench added for easy experimenting with Carrot² clustering, radically simplified Java API, search results clustering web application re-implemented, user manual available |
2.1.0 | August 2007 | Document Clustering Server added for exposing clustering as a REST Rest Rest may refer to:* Leisure* Human relaxation* SleepRest may also refer to:* Rest , a pause in a piece of music* Rest , the relation between two observers* Rest , a 2008 album by Gregor Samsa... service |
2.0.0 | September 2006 | New user interface of the search results clustering web application |
1.0.0 | January 2006 | First official release, binaries available on SourceForge SourceForge SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself... |
0.0.0 | since 2002 | Incubation releases, source code available on SourceForge SourceForge SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself... |
Architecture and components
The architecture of Carrot² is based on processing components arranged into pipelines. Two major groups or processing components in Carrot² are: document sources and clustering algorithms.Document sources
Document sources provide data for further processing. Typically, they would e.g. fetch search results from an external search engine, Lucene / Solr index or load text files from a local disk.Currently, Carrot² has built-in support for the following document sources:
- Yahoo! Search BOSSYahoo! Search BOSSYahoo! Search BOSS is a Yahoo! Developer Network initiative to provide an open search web services platform.- Description :...
API - GoogleGoogleGoogle Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
Search API - Bing Search API
- Google DesktopGoogle DesktopGoogle Desktop is desktop search software made by Google for Linux, Mac OS X, and Microsoft Windows. The program allows text searches of a user's e-mails, computer files, music, photos, chats, Web pages viewed, and other "Google Gadgets"....
- Open Search
- PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
- LuceneLuceneApache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....
index - SolrSolrSolr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling...
server - eTools metasearch engine
- Generic XML files
Other document sources can be easily integrated based on the code examples
provided with Carrot² distribution.
Clustering algorithms
Carrot² offers two specialized document clustering algorithms that place emphasis on the quality of cluster labels:- Lingo: a clustering algorithm based on the Singular value decompositionSingular value decompositionIn linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....
- STC: Suffix TreeSuffix treeIn computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...
Clustering
Other algorithms can be easily added to Carrot².
Java API
Being implemented in Java, Carrot² can be integrated with Java software through its native Java API.C# / .NET API
Carrot² provides a native C# API for calling clustering from C# / .NET.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...
software without installing a Java runtime. The Carrot² C# API requires .NET Framework
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...
version 3.5 or later.
Other platforms
Other platforms can call Carrot² clustering through the RESTRest
Rest may refer to:* Leisure* Human relaxation* SleepRest may also refer to:* Rest , a pause in a piece of music* Rest , the relation between two observers* Rest , a 2008 album by Gregor Samsa...
service exposed by the Document Clustering Server. Example integration code is provided for PHP5, C#, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...
and CURL
CURL
cURL is a computer software project providing a library and command-line tool for transferring data using various protocols. The cURL project produces two products, libcurl and cURL...
.
Tools
Carrot² offers a number of supporting tools that can be used to quickly set up clustering on custom data, further tuning of clustering results and exposing Carrot² clustering as a remote service:- Carrot2 Document Clustering Workbench: a standalone GUI application for experimenting with Carrot² clustering on data from common search engines or custom data,
- Carrot2 Document Clustering Server: exposes Carrot² clustering as a RESTRestRest may refer to:* Leisure* Human relaxation* SleepRest may also refer to:* Rest , a pause in a piece of music* Rest , the relation between two observers* Rest , a 2008 album by Gregor Samsa...
service, - Carrot2 Command Line Interface: applications that allow invoking Carrot² clustering from command line,
- Carrot2 Web Application: exposes Carrot² clustering as a web application for end users.
Carrot Search
Carrot Search, a commercial spin-off of the Carrot² project, works on further development of Carrot², offers a real-time text clustering algorithm compliant with the Carrot² framework as well as text mining consulting services based on open source and proprietary software.Carrot Search Labs
Carrot² gave rise to a number of independent open source projects released under the umbrella of Carrot Search Labs. Currently, the following projects are available:- High Performance Primitive Collections for Java: Lists, Sets, Maps and other collections of primitives for Java tuned for highest performance and memory efficiency.
- jSuffixArrays: Several Java implementations of the Suffix Array data structure with different performance and memory characteristics.
- JUnitBenchmarks: A set of extensions for turning JUnit4 tests into performance micro-benchmarks with GC monitoring, time variance measurement and simple graphical visualizations.
- SmartSprites: fully automatic maintenance of CSS sprites; no tedious copying and pasting to the CSS when adding or changing sprited images.