YaCy
Encyclopedia
YaCy is a free
distributed search engine
, built on principles of peer-to-peer
(P2P) networks. Its core is a computer program
written in Java
distributed on several hundred computers, , so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks.
Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server
exists. It can be run either in a crawling
mode or as a local proxy server
, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy.)
Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.
The program is released under the GPL
license.
Crawler: A search robot which traverses from web page to web page and analyzes their context.
Indexer: Creates a Reverse Word Index (RWI) i.e. each word from the RWI has its list of relevant URLs and Ranking information. Words are saved in form of word hashes.
Search and Administration interface: Made as a web interface provided by a local HTTP servlet with servlet engine.
Data Storage: Used to store the Reverse Word Index Database utilizing a Distributed Hash Table
.
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
distributed search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
, built on principles of peer-to-peer
Peer-to-peer
Peer-to-peer computing or networking is a distributed application architecture that partitions tasks or workloads among peers. Peers are equally privileged, equipotent participants in the application...
(P2P) networks. Its core is a computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
written in Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
distributed on several hundred computers, , so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks.
Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
exists. It can be run either in a crawling
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
mode or as a local proxy server
Proxy server
In computer networks, a proxy server is a server that acts as an intermediary for requests from clients seeking resources from other servers. A client connects to the proxy server, requesting some service, such as a file, connection, web page, or other resource available from a different server...
, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy.)
Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.
The program is released under the GPL
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....
license.
Architecture
YaCy search engine is based on four elements:Crawler: A search robot which traverses from web page to web page and analyzes their context.
Indexer: Creates a Reverse Word Index (RWI) i.e. each word from the RWI has its list of relevant URLs and Ranking information. Words are saved in form of word hashes.
Search and Administration interface: Made as a web interface provided by a local HTTP servlet with servlet engine.
Data Storage: Used to store the Reverse Word Index Database utilizing a Distributed Hash Table
Distributed hash table
A distributed hash table is a class of a decentralized distributed system that provides a lookup service similar to a hash table; pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key...
.
Advantages
- As there is no central server, the results cannot be censored, and the reliability is (at least theoretically) higher.
- Because the engine is not owned by a company, there is no centralized advertising.
- Because of the design of YaCy, it can be used to index the ‘hidden web’, including TorTor (anonymity network)Tor is a system intended to enable online anonymity. Tor client software routes Internet traffic through a worldwide volunteer network of servers in order to conceal a user's location or usage from someone conducting network surveillance or traffic analysis...
, I2PI2PI2P is a mixed-license, free and open source project building an anonymous network .The network is a simple layer that applications can use to anonymously and securely send...
or FreenetFreenetFreenet is a decentralized, censorship-resistant distributed data store originally designed by Ian Clarke. According to Clarke, Freenet aims to provide freedom of speech through a peer-to-peer network with strong protection of anonymity; as part of supporting its users' freedom, Freenet is free and...
. - It is possible to achieve a high degree of privacy
- The YaCy protocol uses HTTP requests, which preserves transparency and discoverability, while aiding diagnosis and investigation. Performance can be increased to near that of binary-only protocols (like TCPTransmission Control ProtocolThe Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
& UDPUser Datagram ProtocolThe User Datagram Protocol is one of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an Internet Protocol network without requiring...
, see Disadvantages section), with the use of compressionData compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
, such as gzipGzipGzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...
. - Built-in support for OpenSearchOpenSearchOpenSearch is a collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format....
Disadvantages
- As there is no central server and the YaCy network is open to anyone, malicious peers are (theoretically) able to insert inaccurate or commercially biased search results. However, no search result displayed to the user can be 'wrong' since all results are verified by downloading each page from the result set to see if the searched word actually exist on the page from the search result URL.
- The YaCy protocol uses HTTP-Requests, which can be slower than non-text (binary-only) protocols, if left uncompressed.
- Ranking of sites is done on the client side on YaCy (Most users are encouraged to run their own Yacy server, as using a local server is necessary to gain many of the benefits of Yacy). Ranking algorithms, although easily customized, do not have their workload distributed, and are limited to the use of the YaCy word index and whatever analysis can be done on the object being ranked. Therefore, more complex ranking algorithms such as those used by GoogleGoogleGoogle Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
(which analyze rank using a variety of contextual factors developed during webspidering) are not available in YaCy, placing severe limits on most users' means to retrieve the results they seek. For instance, none of the top 10 results returned by YaCy's public search when queried "Google" actually refer to Google's homepage.
See also
- DoobleDoobleDooble is a free and Open Source Web browser. The aim of Dooble is to create a comfortable and safe browsing medium. Currently, Dooble is available for FreeBSD, Linux, OS X, and Windows. Dooble should be compatible with any operating system where Qt is available.-History:The first version was...
, an open source Web Browser with integrated Yacy Tool Widget for the Search Engines - SciencenetSciencenetSciencenet is a distributed search engine at KIT – Liebel-Lab for scientific knowledge.The Sciencenet software is based on p2p technology developed by Michael Christen in collaboration with Liebel-lab at KIT.- Background :...
, a search engine for scientific knowledge, based on YaCy
External links
- YaCy website
- YaCy Web Search
- Peer-Search.net — a public YaCy search client
- English forum
- German forum
- The YaCy-Wiki
- developer page at BerliosBerliOSBerliOS is a project founded by FOKUS, a Fraunhofer Institute located in Berlin, to coordinate the different interest groups in the field of open source software and to assume a neutral coordinator function...
- Demo — search the internet through a random YaCy-member
- YaCy on Twitter
- YaCy Demo 'kupferhammer-keller'