DataparkSearch
Encyclopedia
DataparkSearch is a search engine
designed to organize search
within a website
, group of website
s, intranet
or local system.
DataparkSearch is written in C
. Distributed under the terms of the GNU General Public License
, DataparkSearch is free software
.
In 2005, DataparkSearch participated in the US National Institutes of Standards and Technology's Text Retrieval Conference
(TREC). Their submission in PDF. Results of their runs: dpsearch1, dpsearch2.
<!-- google_ad_section_start -->, <!-- google_ad_section_start(weight=ignore) --> and <!-- google_ad_section_end --> consider as tags to include/exclude.
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
designed to organize search
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...
within a website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...
, group of website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...
s, intranet
Intranet
An intranet is a computer network that uses Internet Protocol technology to securely share any part of an organization's information or network operating system within that organization. The term is used in contrast to internet, a network between organizations, and instead refers to a network...
or local system.
DataparkSearch is written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
. Distributed under the terms of the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....
, DataparkSearch is free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
.
In 2005, DataparkSearch participated in the US National Institutes of Standards and Technology's Text Retrieval Conference
Text Retrieval Conference
The Text REtrieval Conference is an on-going series of workshops focusing on a list of different information retrieval research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology and the Intelligence Advanced Research Projects Activity , and began in 1992...
(TREC). Their submission in PDF. Results of their runs: dpsearch1, dpsearch2.
Key features
- Support for http, httpsHttpsHypertext Transfer Protocol Secure is a combination of the Hypertext Transfer Protocol with SSL/TLS protocol to provide encrypted communication and secure identification of a network web server...
, ftp, nntp and news URL schemes. - htdb virtual URL scheme for indexing SQL databases.
- Indexes text/html, text/xml, text/plain, audio/mpeg (mp3) and image/gif mime types natively.
- External parsers support for other document types, including Microsoft WordMicrosoft WordMicrosoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
, ExcelMicrosoft ExcelMicrosoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...
, RTFRich Text FormatThe Rich Text Format is a proprietary document file format with published specification developed by Microsoft Corporation since 1987 for Microsoft products and for cross-platform document interchange....
, PowerPoint, Adobe Acrobat PDF and FlashAdobe FlashAdobe Flash is a multimedia platform used to add animation, video, and interactivity to web pages. Flash is frequently used for advertisements, games and flash animations for broadcast...
. - Can index multilingual sites using content negotiationContent negotiationContent negotiation is a mechanism defined in the HTTP specification that makes it possible to serve different versions of a document at the same URI, so that user agents can specify which version fit their capabilities the best...
. - Can search all of the word forms using ispellIspellIspell is a spelling checker for Unix that supports most Western languages. It offers several interfaces, including a programmatic interface for use by editors such as emacs...
affixes and dictionaries. - SynonymSynonymSynonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
, acronym and abbreviationAbbreviationAn abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase...
query expansionQuery expansionQuery expansion is the process of reformulating a seed query to improve retrieval performance in information retrieval operations.In the context of web search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents...
based on editable dictionaries, specified by language and charset. - Stop-words, synonymSynonymSynonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s and acronyms lists. - Options to query with all words, all words near to each others, any words, or Boolean queries. A subset of VQL (Verity Query Language) is supported.
- Popularity Rank based on a neural network model.
- Results can be sorted by relevancy (using vector calculation), popularity rank as "Goo" (adding weight for incoming links), and "Neo" (neural network model), last modified time, and by "importance" (a combination of relevancy and popularity rank).
- Supports wide range of character sets support with automated character set and language detection.
- Offers an accentDiacriticA diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...
insensitive search option. - Provides phrase segmenting (tokenizing) for ChineseChinese languageThe Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
, JapaneseJapanese languageis a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, KoreanKorean languageKorean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
and ThaiThai languageThai , also known as Central Thai and Siamese, is the national and official language of Thailand and the native language of the Thai people, Thailand's dominant ethnic group. Thai is a member of the Tai group of the Tai–Kadai language family. Historical linguists have been unable to definitively...
. - Includes an indexer and a web CGI front-end, as well as a search module for ApacheApache HTTP ServerThe Apache HTTP Server, commonly referred to as Apache , is web server software notable for playing a key role in the initial growth of the World Wide Web. In 2009 it became the first web server software to surpass the 100 million website milestone...
web server (mod_dpsearch). - Handles Internationalized Domain NameInternationalized domain nameAn internationalized domain name is an Internet domain name that contains at least one label that is displayed in software applications, in whole or in part, in a language-specific script or alphabet, such as Arabic, Chinese, Russian, Hindi or the Latin alphabet-based characters with diacritics,...
s (IDN). - Summary Extraction Algorithm automatically sums up each document in several sentences.
- Uses If-Modified-Since for efficient transfer of only changed files.
- Can tweak URLs with session IDs and other weird formats, including some JavaScript link decoding.
- Can perform parallel and multi-threaded indexing for faster updating.
- Flexible update scheduling, including options for checking some sections of a site more frequently.
- Handles basic authentication (user name and password) and cookieCookieIn the United States and Canada, a cookie is a small, flat, baked treat, usually containing fat, flour, eggs and sugar. In most English-speaking countries outside North America, the most common word for this is biscuit; in many regions both terms are used, while in others the two words have...
s. - Stores a compressed text version of the documents for extracting and viewing.
- Can specify a default character set and language for a server or subdirectory, or a list of possible languages.
- Noindex tags: <!--UdmComment-->, <NOINDEX>, <!--noindex-->, Google's special comments
<!-- google_ad_section_start -->, <!-- google_ad_section_start(weight=ignore) --> and <!-- google_ad_section_end --> consider as tags to include/exclude.
- Can specify a content body tag.
- Spellchecking for query words with aspell.
- Flexible options and commands to customize search result pages.
- Effective caching gives significant time reduction in search times.
- Query logging stores the query, query parameters and the number of results found.
External links
- Official page of the project
- Home at Google Code
- FreeBSD's port
- Search Tools Product Report: DataparkSearch Engine
- Newslookup.com -- A news service using DataparkSearch Engine.