CiteSeer
Encyclopedia
CiteSeer was a public search engine and digital library
for scientific and academic papers. It is often considered to be the first automated citation indexing system and was considered a predecessor of academic search tools such as Google Scholar
and Microsoft Academic Search. It was replaced by CiteSeerx
and all queries to CiteSeer are redirected to it. It was created by researchers Steve Lawrence
, Kurt Bollacker
and Lee Giles
while they were at the NEC Research Institute (now NEC Labs), Princeton, New Jersey
, USA. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous citation index
ing to permit querying by citation or by document, ranking them by citation impact
. After NEC, it was hosted as CiteSeer.IST on the World Wide Web
at the College of Information Sciences and Technology, The Pennsylvania State University
, and had over 700,000 documents, primarily in the fields of computer
and information science
and engineering.
CiteSeer freely provided Open Archives Initiative
metadata
of all indexed documents and links indexed documents when possible to other sources of metadata such as DBLP
and the ACM Portal
.
CiteSeer's goal was to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered as part of the open access movement that is attempting to change academic and scientific publishing
to allow greater access to scientific literature.
The name can be construed to have at least two explanations. As a pun, a 'sightseer' is a tourist who looks at the sights, so a 'cite seer' would be a researcher who looks at cited papers. Another is a 'seer' is a prophet and a 'cite seer' is a prophet of citations.
CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it only has access to papers that are publicly available, usually at an author's homepage, or those are submitted by an author. To overcome these limitations, an modular and open source architecture of CiteSeer was designed.
The new version and design of CiteSeer can be found at the Next Generation CiteSeer, CiteSeerx
, website. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. As such authors whose documents are freely available are more likely to be represented in the index.
and in e-business with eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at BizSeer.IST but is no longer in service. For enhanced access and performance, similar versions of CiteSeer were supported at universities such as the Massachusetts Institute of Technology
, University of Zürich
and the National University of Singapore
. However, these versions of CiteSeer proved difficult to maintain and are no longer available.
Versions of CiteSeer have been or are available at the following links:
Other Seer-like search and repository systems have been built for chemistry, ChemXSeer
and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer
. All of these are built on the open source tool SeerSuite
, which uses the open source indexer Lucene
.
, enhanced CiteSeer both as a search engine and as a digital library and continues in the CiteSeer tradition. As an example, CiteSeer's notion of "contribution" to acknowledgments
in addition to citations made it the first automatically generated acknowledgment index
.
CiteSeerx is designed differently from CiteSeer with new algorithms for entity extraction and a modular, expandable, robust, scalable architecture based on the open source tool SeerSuite
which uses Solr
and many other Apache projects. As such, CiteSeerx hopes to promote the creation of other Seer-like systems. This design has permitted CiteSeer"x" to add a new table search feature and a feature for author disambiguation.
CiteSeerx regularly gives away its data resources such as document pdfs, ascii, databases and metadata to other researchers and scholars. The current model of distribution is rsync
.
The Next Generation CiteSeer, CiteSeerx, is now available in beta
http://citeseerx.ist.psu.edu with nearly 2 million documents indexed and is constantly growing.
Digital library
A digital library is a library in which collections are stored in digital formats and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks...
for scientific and academic papers. It is often considered to be the first automated citation indexing system and was considered a predecessor of academic search tools such as Google Scholar
Google Scholar
Google Scholar is a freely accessible web search engine that indexes the full text of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes most peer-reviewed online journals of Europe and America's largest...
and Microsoft Academic Search. It was replaced by CiteSeerx
CiteSeerX
CiteSeerX is a public search engine and digital library and repository for scientific and academic papers with a focus on computer and information science. It is loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite,...
and all queries to CiteSeer are redirected to it. It was created by researchers Steve Lawrence
Steve Lawrence (computer scientist)
Dr. Steve Lawrence was among the group at NEC Research which was responsible for the creation of the Search Engine/Digital Library CiteSeer. He is currently an employee at Google....
, Kurt Bollacker
Kurt Bollacker
Dr. Kurt Bollacker is a computer scientist with a research background in the areas of machine learning, digital libraries, semantic networks, and electro-cardiographic modeling. He received a Ph.D. in Computer Engineering from The University Of Texas At Austin...
and Lee Giles
Lee Giles
C. Lee Giles is the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University. He is also Professor of Computer Science and Engineering, Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research...
while they were at the NEC Research Institute (now NEC Labs), Princeton, New Jersey
Princeton, New Jersey
Princeton is a community located in Mercer County, New Jersey, United States. It is best known as the location of Princeton University, which has been sited in the community since 1756...
, USA. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous citation index
Citation index
A citation index is a kind of bibliographic database, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. The first citation indices were legal citators such as Shepard's Citations...
ing to permit querying by citation or by document, ranking them by citation impact
Citation impact
Citation is the process of acknowledging or citing the author, year, title, and locus of publication of a source used in a published work. Such citations can be counted as measures of the usage and impact of the cited work. This is called citation analysis or bibliometrics...
. After NEC, it was hosted as CiteSeer.IST on the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
at the College of Information Sciences and Technology, The Pennsylvania State University
Pennsylvania State University
The Pennsylvania State University, commonly referred to as Penn State or PSU, is a public research university with campuses and facilities throughout the state of Pennsylvania, United States. Founded in 1855, the university has a threefold mission of teaching, research, and public service...
, and had over 700,000 documents, primarily in the fields of computer
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and information science
Information science
-Introduction:Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...
and engineering.
CiteSeer freely provided Open Archives Initiative
Open Archives Initiative
The Open Archives Initiative is an attempt to build a "low-barrier interoperability framework" for archives containing digital content . It allows people to harvest metadata...
metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
of all indexed documents and links indexed documents when possible to other sources of metadata such as DBLP
DBLP
DBLP is a computer science bibliography website hosted at Universität Trier, in Germany. It was originally a database and logic programming bibliography site, and has existed at least since the 1980s. DBLP listed more than 1.3 million articles on computer science in January 2010...
and the ACM Portal
ACM Portal
The ACM Portal is an online service of the Association for Computer Machinery. Its core are two main sections: ACM Digital Library , and the Guide to Computing Literature....
.
CiteSeer's goal was to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered as part of the open access movement that is attempting to change academic and scientific publishing
Academic publishing
Academic publishing describes the subfield of publishing which distributes academic research and scholarship. Most academic work is published in journal article, book or thesis form. The part of academic written output that is not formally published but merely printed up or posted is often called...
to allow greater access to scientific literature.
The name can be construed to have at least two explanations. As a pun, a 'sightseer' is a tourist who looks at the sights, so a 'cite seer' would be a researcher who looks at cited papers. Another is a 'seer' is a prophet and a 'cite seer' is a prophet of citations.
CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it only has access to papers that are publicly available, usually at an author's homepage, or those are submitted by an author. To overcome these limitations, an modular and open source architecture of CiteSeer was designed.
The new version and design of CiteSeer can be found at the Next Generation CiteSeer, CiteSeerx
CiteSeerX
CiteSeerX is a public search engine and digital library and repository for scientific and academic papers with a focus on computer and information science. It is loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite,...
, website. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. As such authors whose documents are freely available are more likely to be represented in the index.
Other CiteSeer engines
The CiteSeer model had been extended to cover academic documents in business with SmealSearchSmealSearch
SmealSearch was a web portal, search engine and digital library for academic business documents that was originally hosted at the defunct eBusiness Research Center at the Pennsylvania State University. It was based on the CiteSeer digital library and search engine technology...
and in e-business with eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at BizSeer.IST but is no longer in service. For enhanced access and performance, similar versions of CiteSeer were supported at universities such as the Massachusetts Institute of Technology
Massachusetts Institute of Technology
The Massachusetts Institute of Technology is a private research university located in Cambridge, Massachusetts. MIT has five schools and one college, containing a total of 32 academic departments, with a strong emphasis on scientific and technological education and research.Founded in 1861 in...
, University of Zürich
University of Zurich
The University of Zurich , located in the city of Zurich, is the largest university in Switzerland, with over 25,000 students. It was founded in 1833 from the existing colleges of theology, law, medicine and a new faculty of philosophy....
and the National University of Singapore
National University of Singapore
The National University of Singapore is Singapore's oldest university. It is the largest university in the country in terms of student enrollment and curriculum offered....
. However, these versions of CiteSeer proved difficult to maintain and are no longer available.
Versions of CiteSeer have been or are available at the following links:
Other Seer-like search and repository systems have been built for chemistry, ChemXSeer
ChemXSeer
ChemXSeer project, funded by the National Science Foundation, is a public integrated digital library, database, and search engine for scientific papers in chemistry. It is being developed by a multidisciplinary team of researchers at the Pennsylvania State University. ChemXSeer was conceived by Dr....
and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer
BotSeer
BotSeer was a Web-based information system and search tool that provides resources and services for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It was created and designed by , , and C. Lee Giles....
. All of these are built on the open source tool SeerSuite
SeerSuite
SeerSuite refers a to a collection of open source tools that provide the underlying application software for creating academic search engines and digital libraries such as CiteSeerX, ChemXSeer, and ArchSeer...
, which uses the open source indexer Lucene
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....
.
Next Generation CiteSeer (CiteSeerx)
The Next Generation CiteSeer project, CiteSeerx, funded by the National Science Foundation and initially by Microsoft ResearchMicrosoft Research
Microsoft Research is the research division of Microsoft created in 1991 for developing various computer science ideas and integrating them into Microsoft products. It currently employs Turing Award winners C.A.R. Hoare, Butler Lampson, and Charles P...
, enhanced CiteSeer both as a search engine and as a digital library and continues in the CiteSeer tradition. As an example, CiteSeer's notion of "contribution" to acknowledgments
Acknowledgment (creative arts)
In the creative arts and scientific literature, an acknowledgment is an expression of gratitude for assistance in creating a literary or artistic work....
in addition to citations made it the first automatically generated acknowledgment index
Acknowledgment index
An acknowledgment index is a method for indexing and analyzing acknowledgements in the scientific literature and, thus, quantifies the impact of acknowledgments. Typically, a scholarly article has a section where the authors acknowledge entities such as funding, technical staff, colleagues, etc....
.
CiteSeerx is designed differently from CiteSeer with new algorithms for entity extraction and a modular, expandable, robust, scalable architecture based on the open source tool SeerSuite
SeerSuite
SeerSuite refers a to a collection of open source tools that provide the underlying application software for creating academic search engines and digital libraries such as CiteSeerX, ChemXSeer, and ArchSeer...
which uses Solr
Solr
Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling...
and many other Apache projects. As such, CiteSeerx hopes to promote the creation of other Seer-like systems. This design has permitted CiteSeer"x" to add a new table search feature and a feature for author disambiguation.
CiteSeerx regularly gives away its data resources such as document pdfs, ascii, databases and metadata to other researchers and scholars. The current model of distribution is rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
.
The Next Generation CiteSeer, CiteSeerx, is now available in beta
http://citeseerx.ist.psu.edu with nearly 2 million documents indexed and is constantly growing.
See also
- Citation indexCitation indexA citation index is a kind of bibliographic database, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. The first citation indices were legal citators such as Shepard's Citations...
- CiteSeerXCiteSeerXCiteSeerX is a public search engine and digital library and repository for scientific and academic papers with a focus on computer and information science. It is loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite,...
- CiteULikeCiteULikeCiteULike is based on the principle of social bookmarking and is aimed to promote and to develop the sharing of scientific references amongst researchers. In the same way that it is possible to catalog web pages or photographs , scientists can share information on academic papers with specific...
- ChemXSeerChemXSeerChemXSeer project, funded by the National Science Foundation, is a public integrated digital library, database, and search engine for scientific papers in chemistry. It is being developed by a multidisciplinary team of researchers at the Pennsylvania State University. ChemXSeer was conceived by Dr....
- The Collection of Computer Science BibliographiesThe Collection of Computer Science BibliographiesThe Collection of Computer Science Bibliographies is one of the oldest bibliography collections freely accessible on the Internet. It is a collection of bibliographies of scientific literature in computer science and mathematics from various sources, covering most aspects of computer science...
- DBLPDBLPDBLP is a computer science bibliography website hosted at Universität Trier, in Germany. It was originally a database and logic programming bibliography site, and has existed at least since the 1980s. DBLP listed more than 1.3 million articles on computer science in January 2010...
(Digital Bibliography & Library Project) - getCITEDGetCITEDGetCITED is a website database that lists publication and citation information on academic articles whose information is entered by members. It aims to include not only journal articles but also book chapters and other publications, both peer-reviewed and non-reviewed...
- Google Scholar
- Institute for Scientific InformationInstitute for Scientific InformationThe Institute for Scientific Information was founded by Eugene Garfield in 1960. It was acquired by Thomson Scientific & Healthcare in 1992, became known as Thomson ISI and now is part of the Healthcare & Science business of the multi-billion dollar Thomson Reuters Corporation.ISI offered...
's Web of ScienceWeb of ScienceISI Web of Knowledge is an academic citation indexing and search service, which is combined with web linking and provided by Thomson Reuters. Web of Knowledge coverage encompasses the sciences, social sciences, arts and humanities. It provides bibliographic content and the tools to access, analyze,... - Libra (Academic Search)Libra (Academic Search)Microsoft Academic Search is a free academic search engine developed by Microsoft Research. It covers more than 36 million publications and over 18 million authors across a variety of domains with updates added each week...
- List of academic databases and search engines
- ScirusScirusScirus is a comprehensive science-specific search engine. Like CiteSeerX and Google Scholar, it is focused on scientific information. Unlike CiteSeerX, Scirus is not only for computer sciences and IT and not all of the results include full text. It also sends its scientific search results to...
- ScopusScopusScopus, officially named SciVerse Scopus, is a bibliographic database containing abstracts and citations for academic journal articles. It covers nearly 18,000 titles from over 5,000 international publishers, including coverage of 16,500 peer-reviewed journals in the scientific, technical, medical,...
- SeerSuiteSeerSuiteSeerSuite refers a to a collection of open source tools that provide the underlying application software for creating academic search engines and digital libraries such as CiteSeerX, ChemXSeer, and ArchSeer...
- SmealSearchSmealSearchSmealSearch was a web portal, search engine and digital library for academic business documents that was originally hosted at the defunct eBusiness Research Center at the Pennsylvania State University. It was based on the CiteSeer digital library and search engine technology...
External links
- Official website of CiteSeerx
- ParaCite
- Citebase
- DBLP
- The Collection of Computer Science Bibliographies (includes among other collections also CiteSeer and DBLP)
- CiteSeer search tool
- Digital Libraries and Autonomous Citation Indexing by Steve Lawrence, C. Lee Giles and Kurt Bollacker
- Libra Academic Search, MSRA