List of Web Archiving Initiatives
Encyclopedia
This page contains a list of Web archiving
initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data and access methods.
Web archiving
Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...
initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data and access methods.
Web archiving initiatives
Country | Technologies | Comments | ||||
---|---|---|---|---|---|---|
Full-time | Part-time | |||||
Australia's Web Archive Pandora Archive PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting... |
Australia Australia Australia , officially the Commonwealth of Australia, is a country in the Southern Hemisphere comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands in the Indian and Pacific Oceans. It is the world's sixth-largest country by total area... |
1996 | PANDORA Digital Archiving System (PANDAS), NLA Trove, HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... . |
4 | >4.25 | It is a collaborative program of 11 agencies that provide an estimate average monthly staffing equivalent to 4 FTE. IT outsourced support: 0.25 person-month. Whole Domain Harvests are conducted by the Internet Archive Internet Archive The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive... using Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... . |
Our digital island, a Tasmanian Web Archive | Australia Australia Australia , officially the Commonwealth of Australia, is a country in the Southern Hemisphere comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands in the Indian and Pacific Oceans. It is the world's sixth-largest country by total area... |
1996 | HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... , Experimentally: Web Curator, Heritrix and Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
1 | ||
PageFreezer | Canada Canada Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean... , US, Netherlands Netherlands The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders... , Belgium Belgium Belgium , officially the Kingdom of Belgium, is a federal state in Western Europe. It is a founding member of the European Union and hosts the EU's headquarters, and those of several other major international organisations such as NATO.Belgium is also a member of, or affiliated to, many... |
2005 | PageFreezer's Deep Web Crawler, Lucene, Solr | Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. | ||
Web@rchive Austria | Austria Austria Austria , officially the Republic of Austria , is a landlocked country of roughly 8.4 million people in Central Europe. It is bordered by the Czech Republic and Germany to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the... |
2008 | Archive-access tools and NetarchiveSuite.dk | 2 | ||
DILIMAG (Digital Literature Magazines) | Austria Austria Austria , officially the Republic of Austria , is a landlocked country of roughly 8.4 million people in Central Europe. It is bordered by the Czech Republic and Germany to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the... |
2007 | WebCurator | 2 | One technician, one for collecting and metadata. | |
Government of Canada Web Archive (GCWA) | Canada Canada Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean... |
2005 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... and Nutchwax. |
2 | ||
Web Information Collection and Preservation - WICP (Chinese Web Archive) | China China Chinese civilization may refer to:* China for more general discussion of the country.* Chinese culture* Greater China, the transnational community of ethnic Chinese.* History of China* Sinosphere, the area historically affected by Chinese culture... |
2003 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... and Nutchwax. |
|||
Croatian Web Archive (Hrvatski arhiv weba - HAW) | Croatia Croatia Croatia , officially the Republic of Croatia , is a unitary democratic parliamentary republic in Europe at the crossroads of the Mitteleuropa, the Balkans, and the Mediterranean. Its capital and largest city is Zagreb. The country is divided into 20 counties and the city of Zagreb. Croatia covers ... |
2004 | Lucene Lucene Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.... |
4 | 3 | 2 librarians full time, 2 librarians part time, 1 IT professional (National and University Library in Zagreb), 1 or 2 IT professionals (from Zagreb University Computing Centre (Srce)- our partner) |
WebArchiv WebArchiv WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation.... (National Library of the Czech Republic) |
Czech Republic Czech Republic The Czech Republic is a landlocked country in Central Europe. The country is bordered by Poland to the northeast, Slovakia to the east, Austria to the south, and Germany to the west and northwest.... |
2000 | Nutch, NutchWAX and WERA tools. | 5 | 3.5 FTE library staff + approx. 1.5 FTE technical staff | |
Netarkivet.dk | Denmark Denmark Denmark is a Scandinavian country in Northern Europe. The countries of Denmark and Greenland, as well as the Faroe Islands, constitute the Kingdom of Denmark . It is the southernmost of the Nordic countries, southwest of Sweden and south of Norway, and bordered to the south by Germany. Denmark... |
2005 | NetarchiveSuite.dk and Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... . |
18 | 18 people involved (developers, librarians, operations staff, project managers). All together 5 FTE. | |
Finnish Web Archive | Finland Finland Finland , officially the Republic of Finland, is a Nordic country situated in the Fennoscandian region of Northern Europe. It is bordered by Sweden in the west, Norway in the north and Russia in the east, while Estonia lies to its south across the Gulf of Finland.Around 5.4 million people reside... |
2008 | NutchWAX | 2 | >2 | Group of librarians that in part-time select what to archive from the Finnish web space. |
BnF - BnF Web Legal Deposit | France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... |
2006 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... and NutchWAX. NetarchiveSuite. |
9 | ||
Ina (Institut National de l'Audiovisuel) Institut national de l'audiovisuel The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia... |
France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... |
2009 | Crawl : PhagoSite, Croket, Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... / Access : Dowser |
6 | Staff of 80 documentalists taking part in nominating sites and QA | |
E-diaspora (Télécom ParisTech, FMSH) | France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... |
2010 | Crawl : PhagoSite | 1 | 30 researchers taking part in nominating sites | |
Internet Memory Foundation (ATN service) | France France The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France... , Netherlands Netherlands The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders... |
2004 | IM large scale crawler (under development), Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Hanzo's crawler, IM Access software. Storage of Web Content: Hbase HBase HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop... |
21 | 0 | 11 people for quality crawls (QA, crawl engineering, project management), 9 developers & infrastructure, 1 manager. |
Bibliotheksservice-Zentrum Baden-Württemberg | Germany Germany Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate... |
2003 | 7.5 | |||
Web archive of the German Bundestag | Germany Germany Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate... |
2005 | ||||
Iceland | Iceland Iceland Iceland , described as the Republic of Iceland, is a Nordic and European island country in the North Atlantic Ocean, on the Mid-Atlantic Ridge. Iceland also refers to the main island of the country, which contains almost all the population and almost all the land area. The country has a population... |
2004 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
|||
Japan Web Archiving Project | Japan Japan Japan is an island nation in East Asia. Located in the Pacific Ocean, it lies to the east of the Sea of Japan, China, North Korea, South Korea and Russia, stretching from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south... |
2004 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Solr. Previously: Wget, Accela BizSearch |
10 | 2 | Launched in April 2004 as a pilot project, WARP (Web Archiving Project) has been in full-scale operation since July 2007. |
National Library of Korea - OASIS (Online Archiving & Searching Internet Sources) | Korea Korea Korea ) is an East Asian geographic region that is currently divided into two separate sovereign states — North Korea and South Korea. Located on the Korean Peninsula, Korea is bordered by the People's Republic of China to the northwest, Russia to the northeast, and is separated from Japan to the... |
2001 | Own system based on Oracle DBMS and specialized search engine (IRS) that performs data management and search function. | 3 | 11 | |
Koninklijke Bibliotheek | Netherlands Netherlands The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders... |
2006 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , KB e-Depot system |
1 | ~7 | |
National Library of Latvia | Latvia Latvia Latvia , officially the Republic of Latvia , is a country in the Baltic region of Northern Europe. It is bordered to the north by Estonia , to the south by Lithuania , to the east by the Russian Federation , to the southeast by Belarus and shares maritime borders to the west with Sweden... |
2005 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... |
1 | Currently only storing for preservation, access to public in development (ETA June 2012). The latvian term for web harvesting is "rasmošana". | |
New Zealand Web Archive | New Zealand New Zealand New Zealand is an island country in the south-western Pacific Ocean comprising two main landmasses and numerous smaller islands. The country is situated some east of Australia across the Tasman Sea, and roughly south of the Pacific island nations of New Caledonia, Fiji, and Tonga... |
1999 | Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
3 | >10 | 3-4 people at the National Library (various hours) and 2 people at the Internet Archive during the time of domain harvests. Selective web archiving = 3 full time staff. Technical services = 1 staff member responds to technical problems when they arise. National Digital library = 2-3 staff members ad hoc. NDHA (National Digital Heritage Archive) = various staff members respond to web archiving issues as they arise. |
The National Library of Norway | Norway Norway Norway , officially the Kingdom of Norway, is a Nordic unitary constitutional monarchy whose territory comprises the western portion of the Scandinavian Peninsula, Jan Mayen, and the Arctic archipelago of Svalbard and Bouvet Island. Norway has a total area of and a population of about 4.9 million... |
|||||
Portuguese Web Archive Portuguese Web Archive The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National... |
Portugal Portugal Portugal , officially the Portuguese Republic is a country situated in southwestern Europe on the Iberian Peninsula. Portugal is the westernmost country of Europe, and is bordered by the Atlantic Ocean to the West and South and by Spain to the North and East. The Atlantic archipelagos of the... |
2007 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , NutchWAX |
4 | 1 | |
Web archive of Čačak | Serbia Serbia Serbia , officially the Republic of Serbia , is a landlocked country located at the crossroads of Central and Southeast Europe, covering the southern part of the Carpathian basin and the central part of the Balkans... |
2009 | HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... |
1 | ||
Web Archive Singapore | Singapore Singapore Singapore , officially the Republic of Singapore, is a Southeast Asian city-state off the southern tip of the Malay Peninsula, north of the equator. An island country made up of 63 islands, it is separated from Malaysia by the Straits of Johor to its north and from Indonesia's Riau Islands by the... |
Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , NutchWAX, WERA |
||||
Slovenian Web Archive | Slovenia Slovenia Slovenia , officially the Republic of Slovenia , is a country in Central and Southeastern Europe touching the Alps and bordering the Mediterranean. Slovenia borders Italy to the west, Croatia to the south and east, Hungary to the northeast, and Austria to the north, and also has a small portion of... |
2007 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
1 | ||
Digital Preservation of .ES domain | Spain Spain Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula... |
2006 | Internet Archive | 2 | >2 | Can pool additional resources if necessary from computing controllers and financial department. |
Digital Heritage of Catalonia PADICAT PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format... |
Spain Spain Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula... |
2006 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , WERA, Nutchwax and Web Curator. |
4 | ||
Basque Digital Heritage Archive | Spain Spain Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula... |
2008 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , Nutchwax and Web Curator. |
1 | ||
Sweden (Kulturarw3) | Sweden Sweden Sweden , officially the Kingdom of Sweden , is a Nordic country on the Scandinavian Peninsula in Northern Europe. Sweden borders with Norway and Finland and is connected to Denmark by a bridge-tunnel across the Öresund.... |
1996 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... . Own system for storage, maintenance and access |
1.25 | Paus in operation november 2009 - may 2011. | |
Aleph Archives | Switzerland Switzerland Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition.... /USA |
2010 | Distributed crawler, ArchiView access plugin, High performance search engine, Near real time indexing, Web Monitoring tools | 7 | Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...). | |
Web Archive Switzerland | Switzerland Switzerland Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition.... |
2008 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
3 | 1 crawl engineer, 1 person for quality assurance, 1 coordinator. The curators, who do the selection, are partner libraries all over Switzerland. | |
NTU Web Archiving System, NTUWAS | Taiwan Taiwan Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following... |
2007 | Lucene Lucene Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.... |
3 | ||
Web Archive Taiwan | Taiwan Taiwan Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following... |
2007 | ||||
The UK Web Archive | UK | 2004 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Web Curator Tool, Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... and moving to Solr for searching. |
|||
Hanzo Archives | UK | 2006 | Hanzo Crawler, Search, and Access Tools. | Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | ||
UK Government Web Archive | UK | 2004 | ATN Service | 4 | 2 | Technical side of our web archiving operation is contracted out to the Internet Memory Foundation so the figures account for QA, curatorial and management staff only |
Internet Archive Internet Archive The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive... (provides Archive-it service) |
USA | 1996 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , NutchWAX and other tools developed by the Internet Archive |
12 | ||
Reed Technology Web Archiving Services | USA | 2010 | TrueArchive™ Technology | Reed Technology Web Archiving Services provides support for Litigation Protection, Compliance, e-Discovery and Social Media Management. | ||
Columbia University Libraries Web Resources Collection Program | USA | 2009 | Archive-it service | 3 | >1 | Part-time consultation/supervision from other librarians adding up to about 1 FTE. |
North Carolina State Government Web Site Archives | USA | 2005 | Archive-it service | 3 | ||
Latin American Web Archiving Project | USA | 2005 | Archive-it service | |||
Web Archiving Project for the Pacific Islands | USA | Archive-it service | 4 | |||
Library of Congress Web Archives | USA | 2000 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , and the DigiBoard, an in-house curatorial/permissions tool |
6 | 80 | The part time workers spend a few hours per month (on average) selecting content for the collections. |
Harvard University Library: the Web Archive Collection Service (WAX) | USA | 2006 | Own system based on Archive-access and other open-source tools. | >6 | 3 part time on IT support. External curators within 3 units but don't know the size of them. | |
Web Archiving Service from California Digital Library (WAS service) | USA | 2005 | Heritix, Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... , NutchWAX |
4 | >1 | The number of hours that curators devote to the service is very variable. |
University of Michigan Web Archives Project | USA | 2000 | WAS service | 2 | ||
University of Texas at San Antonio Web Archives | USA | 2009 | Archive-It | 3 | The number of hours varies dependent upon how the crawls are scheduled. | |
qumram | Switzerland Switzerland Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition.... |
2010 | Chronos Web Archiving Software Suite | Commercial web archiving software suite. Provides both harvesting as well as transactional web archiving. Allows integrations with any possible repository (database, file system, electronic archive or records management system). Specializes on regulatory compliance. | ||
SAPERION | Germany Germany Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate... |
2011 | SAPERION ECM Web Content Archive | Commercial enterprise content management Enterprise content management Enterprise Content Management is a formalized means of organizing and storing an organization's documents, and other content, that relate to the organization's processes... suite specializes on regulatory compliance. The product provides both harvesting as well as transactional web archiving based on the integration of qumram´s Chronos Web Archiving Software Suite. Web content is just another chanel from which content is reaching SAPERION. Others may be scanner, fax, e-mail, mobiles devices, office suites or any other system creating content like ERP ERP - Economics :* Economic Report of the President, published annually by the United States President's Council of Economic Advisors on recent economic activity and future policies and predictions... systems. |
||
Bibliotheca Alexandrina Bibliotheca Alexandrina The Bibliotheca Alexandrina or Maktabat al-Iskandarīyah is a major library and cultural center located on the shore of the Mediterranean Sea in the Egyptian city of Alexandria... 's Internet Archive |
Egypt Egypt Egypt , officially the Arab Republic of Egypt, Arabic: , is a country mainly in North Africa, with the Sinai Peninsula forming a land bridge in Southwest Asia. Egypt is thus a transcontinental country, and a major power in Africa, the Mediterranean Basin, the Middle East and the Muslim world... |
2002 | Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... , Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... |
3 | Current crawling interests: Egypt beyond January 25, Arab League ccTLDs | |
Archived data
Archived Contents (millions) | Archive Format | Selective Crawls (Yes/No) | Australia's Web Archive Pandora Archive PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting... | 3100 | 104.5 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
.AU .au .au is the Internet country code top-level domain for Australia.-History:The domain name was originally allocated by Jon Postel, operator of IANA to Kevin Robert Elz of Melbourne University in 1986. After an approximately five year process in the 1990s, the Internet industry created a self... |
Y | .AU .au .au is the Internet country code top-level domain for Australia.-History:The domain name was originally allocated by Jon Postel, operator of IANA to Kevin Robert Elz of Melbourne University in 1986. After an approximately five year process in the 1990s, the Internet industry created a self... crawls (2005-2009): 3 billion files (100 TB). Selective crawls (1996-today): 100 million files (4.5 TB). There are 3 copies of each content. |
---|---|---|---|---|---|---|---|---|---|
Our digital island, a Tasmanian Web Archive | 0.336 | HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... |
Y | Preserves online contents related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of ‘Book’ in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required. | |||||
Web@rchive Austria | 455 | 6.61 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
.AT .at .at is the Internet country code top-level domain for Austria. It is administered by .The .at top-level domain has a number of second-level domains... |
Y | A copy of the data will be stored in a high security data storage unit. | |||
DILIMAG (Digital Literature Magazines) | 0.03 | 0.996 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines. | |||||
Government of Canada Web Archive (GCWA) | 170 | 7 | Y | Selective crawls of the web domain of the Federal Government of Canada (.GC.CA .gc.ca .gc.ca is a privately held second level domain for the Government of Canada, run by Government Telecommunications and Informatics Services. They hold all third level domains under the .gc.ca banner.-External links:*http://registry.gc.ca*http://www.gc.ca... ) |
|||||
Web Information Collection and Preservation - WICP (Chinese Web Archive) | .GOV.CN | Y | Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn' domain. |
||||||
Croatian Web Archive (Hrvatski arhiv weba - HAW) | 81 | 3.4 | Y | ||||||
WebArchiv WebArchiv WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation.... (National Library of the Czech Republic) |
526 | 24 | .CZ .cz .cz is the country code top-level domain for the Czech Republic. It is administered by CZ.NIC. Registrations must be ordered via accredited domain name registrars.Before the split in 1993 former Czechoslovakia used domain .cs.... |
Y | Harvesting began in 2001. | ||||
Netarkivet.dk | 6008 | 190 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
.DK .dk .dk is the country code top-level domain for Denmark. The supervision of the .dk top-level domain is handled exclusively by DK Hostmaster. Any new .dk domain name has to be applied for via an approved registrator. Then the domain name applicant can ask the registrator to manage his domain name or... |
Y | It uses NetarchiveSuite.dk was developed by two Danish libraries and Heritrix Heritrix Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed... . |
|||
Finnish Web Archive | 494 | 23 | .FI .fi .fi is the Internet country code top-level domain for Finland. It is operated by FICORA, the Finnish Communications Regulatory Authority.... , .AX .ax .ax is the Internet country code top-level domain of the Åland Islands, introduced in 2006. Previously, most Åland websites were under the .aland.fi subdomain.-History:... |
Y | Also crawls contents hosted on machines physically located in Finland, independently from their domain. | ||||
BnF - BnF Web Legal Deposit | 14000 | 200 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
.FR .fr .fr is the country code top-level domain in the Domain Name System of the Internet for France. Along with .re and .tf, it is administered by AFNIC.The .fr top-level domain is divided into a number of second-level domains:... |
Y | ||||
Ina (Institut National de l'Audiovisuel) Institut national de l'audiovisuel The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia... |
8400 | 56 | DAFF | N | Y | Digital Archive file format handles file redundancies. The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 665 Tb | |||
E-diaspora (Télécom ParisTech, FMSH) | 237 | 2 | DAFF | N | N | Digital Archive file format handles file redundancies.The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 10 Tb | |||
Internet Memory Foundation (ATN service) | 180 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Can be done by partners | Y | Formerly European Archive. Provides the Archive The Net Service (ATN Service). Selective crawls (140 TB), Domain crawls (40 TB), expect to grow to 1PB in 2011. New datacenter and a new crawler in 2011. | ||||
Bibliotheksservice-Zentrum Baden-Württemberg | 1 | HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... |
Y | Bibliotheksservice-Zentrum Baden-Württemberg -German is operating following Web-Archives: 1- Baden-Württembergisches Online-Archiv (BOA) 2- Saardok 3- Literatur im Netz des Deutschen Literaturarchivs Marbach. |
|||||
Web archive of the German Bundestag | Y | German Federal Parliament. Selective. At regular intervals or at certain events are snapshots (snapshots) of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available. | |||||||
Iceland | |||||||||
Japan Web Archiving Project | 319.8 | 38.2 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
- | Y | 15 TB of selective crawls based on permission (2002-2010). Started the web archiving of official institution sites based on the legislation from April of 2010. | |||
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource) | 24 | Y | Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web arching system will be rebuild. | ||||||
Koninklijke Bibliotheek | 5 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Y | ||||||
New Zealand Web Archive | 346 | 13 | .NZ .nz .nz is the Internet country code top-level domain for New Zealand. It is administered by InternetNZ through its subsidiary, NZ Registry Services, with oversight and dispute resolution handled by the Domain Name Commission Ltd. Registrations are processed via authorised registrars... |
Y | .NZ crawls: 105 million URLs (4.1 TB) in 2008, 170 million URLs (6.1 TB) in 2010. Selective crawls of 7 599 websites in the National Digital Heritage Archive (2.8 TB), 71 million contents estimated. Legal deposit covers born digital Born Digital Born Digital: Understanding the First Generation of Digital Natives is a book by John Palfrey and Urs Gasser exploring the consequences of the wide availability of internet connectivity to the first generation of people born to it, who Palfrey and Gasser refer to as "digital natives"... material (including websites). |
||||
The National Library of Norway | |||||||||
Portuguese Web Archive Portuguese Web Archive The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National... |
889 | 25 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
.PT .pt .pt is the Internet country code top-level domain for Portugal and is managed by the Fundação para a Computação Científica Nacional .It has the following second-level domains:* .com.pt: no restrictions; online registration* .edu.pt: education... , .CV .cv .cv is the country code top-level domain for Cape Verde. It is managed by the National Communications Agency... , .AO .ao .ao is the Internet country code top-level domain for Angola. It is administered by the college of engineering of the University of Agostinho Neto.... , .MZ .mz .mz is the Internet country code top-level domain for Mozambique. Registrations are at the third level beneath the second-level names adv.mz,ac.mz, co.mz, org.mz, gov.mz and edu.mz.-External links:* * *... |
Y | TLD crawls and integration of external collections since 2007, selective crawls since 2010. | |||
Web archive of Čačak | 0.255 | 0.013 | HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... |
Y | Selective crawls of 130 sites related to the city of Čačak. Collaboration with the WebArchiv team from the National Library of the Czech Republic. | ||||
Web Archive Singapore | .SG .sg .sg is the Internet country code top-level domain for Singapore. It is administered by the Singapore Network Information Centre. Registrations are processed via accredited registrars.... |
Y | Selective crawls of 1000 Singapore-related sites, with the written consent of the owners. Whole .SG .sg .sg is the Internet country code top-level domain for Singapore. It is administered by the Singapore Network Information Centre. Registrations are processed via accredited registrars.... domain archiving. |
||||||
Slovenian Web Archive | 1.5 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Selective crawls | ||||||
Digital Preservation of .ES domain | 855 | 30 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
.ES .es .es is the country code top-level domain for Spain. It is administered by the Network Information Center of Spain.Registrations are permitted at the second level or at the third level beneath various generic second level categories. Some qualifications and restrictions apply to third-level... |
Collaboration with Internet Archive. Domain crawl of .ES .es .es is the country code top-level domain for Spain. It is administered by the Network Information Center of Spain.Registrations are permitted at the second level or at the third level beneath various generic second level categories. Some qualifications and restrictions apply to third-level... , harvested quarterly. Not launched publicly yet. |
||||
Digital Heritage of Catalonia PADICAT PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format... |
200 | 7.7 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
.CAT .cat .cat is a sponsored top-level domain intended to be used to highlight the Catalan language and culture. Its policy has been developed by ICANN and Fundació puntCAT... |
Y | In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet (.cat); Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life (elections, museums, etc.) | |||
Basque Digital Heritage Archive | 21 | 0.8 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Y | |||||
Sweden (Kulturarw3) | 1710 | 71.3 | Multipart MIME | .se, Swedish .nu and geolocation for other tld's | Y | Bulk crawls approximately twice a year. Selective crawls of about 140 newspapers every day. |
|||
Aleph Archives | 23 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... , WARC2, ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... and HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... to WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... migration tools |
Y | Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...). | |||||
Web Archive Switzerland | 0.1 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Y | ||||||
NTU Web Archiving System, NTUWAS | 200 | 14 | Y | ||||||
Web Archive Taiwan | |||||||||
The UK Web Archive | 6.9 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Y | Selective crawls with previous permission. Expect to run wholesale UK domain-scale crawls once Legal Deposit legislation is implemented in April 2011. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007. | |||||
Hanzo Archives | 7 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | |||||
UK Government Web Archive | 32 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
The UKGWA is a spin-off from the UK Web Archiving Consortium that ended in 2007. | ||||||
Internet Archive Internet Archive The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive... (provides Archive-it service) |
150000 | 5500 | World-wide | Y | Provides the Archive-it service and leads the Archive-access project (Internet Archive ARC access tools). Collection is mirrored at Bibliotheca of Alexandrina in Egypt. | ||||
Reed Technology Web Archiving Services | |||||||||
Columbia University Libraries Web Resources Collection Program | 23.1 | 1.8 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | Selective crawls with permission or notification; primarily thematic collections. | ||||
North Carolina State Government Web Site Archives | 51.5 | 3.8 | WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | |||||
Latin American Web Archiving Project | Y | ||||||||
Web Archiving Project for the Pacific Islands | 5.5 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | Includes sites of 18 countries. | |||||
Library of Congress Web Archives | 5 | 230 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections. | ||||
Harvard University Library: the Web Archive Collection Service (WAX) | 19 | 0.661 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... |
Y | Selective crawls with no previous authorization. | ||||
Web Archiving Service from California Digital Library (WAS service) | 216 | 25.2 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Can be done by partners | Y | Provides Web Archiving Service (WAS) to partners world-wide. Was developed at the California Digital Library. | |||
University of Michigan Web Archives Project | 0.65 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | WAS service since 2010. | |||||
University of Texas at San Antonio Web Archives | 26 | 1.135 | ARC ARC (file format) ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC... /WARC Web ARChive The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as... |
Y | University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues . |
Access methods
URL history (Yes/No) | Full-text search (Yes/No) | Australia's Web Archive Pandora Archive PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting... | N | Y | Y | Selected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive Pandora Archive PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting... is indexed and searchable through the NLA's single search service Trove. The Australian Domain Harvests are full-text indexed but are not currently publicly available. |
---|---|---|---|---|---|---|
Our digital island, a Tasmanian Web Archive | Y | Y | N | Presents thumbnails generated through Html To Image supplemented in HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... . Information is organized in directory: A-Z Subject listing, A-Z Title listing. |
||
Web@rchive Austria | Y | N | N | Only accessible on special terminals at the Austrian National Library Austrian National Library The Austrian National Library , is the largest library in Austria, with 7.4 million items in its collections. It is located in the Hofburg Palace in Vienna; since 2005 some of the collections are located in the baroque Palais Mollard-Clary... . Presents thumbnail previews of archived pages and supports keyword search within URL. |
||
DILIMAG (Digital Literature Magazines) | Y | Y | N | Metadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search was not implemented due to lack of resources. | ||
Government of Canada Web Archive (GCWA) | Y | Y | Y | Technical details available. | ||
Web Information Collection and Preservation - WICP (Chinese Web Archive) | Y | Archive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection. | ||||
Croatian Web Archive (Hrvatski arhiv weba - HAW) | Y | Y | Y | |||
WebArchiv WebArchiv WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation.... (National Library of the Czech Republic) |
Y | Y | Due to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in WebArchiv is available from public terminals in the National Library. | |||
Netarkivet.dk | Y | N | N | Online access granted only to researchers using a proxy solution that accesses an archive index. Soon it will set up user access through the Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... . It has established a framework for running batch jobs with the possibility of data mining. |
||
Finnish Web Archive | Y | N | 30% of material. | URL search but onsite access to contents. Full-text search is available to 30% of material. | ||
BnF - BnF Web Legal Deposit | Y | N | 15% of the collection | Accessible to authorized users of the BnF, through the reading rooms of the Research Library located in Paris and Avignon. Wayback Machine Wayback Machine The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three... interface was translated to French. Full Text search only for a relatively small portion of the collection (15% of 200 TB) indexed by Internet Archive. No current full text search implemented in workflow. Builds special collection galleries based on a selection from the archive on a given topic. |
||
Ina (Institut National de l'Audiovisuel) Institut national de l'audiovisuel The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia... |
Y | Y | Y | Full text indexing is based on Lucene. To accommodate results from frequent crawls (up to every 2 hours for home pages) clustering is operated to handle similar versions of pages | ||
E-diaspora (Télécom ParisTech, FMSH) | Y | N | N | 1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive (http://ediasporas.ticmigrations.fr/) Ina is hanling crawls and storage | ||
Internet Memory Foundation (ATN service) | Y | Y | Y | Provides access and search services according to partners policy. | ||
Bibliotheksservice-Zentrum Baden-Württemberg | Y | Y | Y | Search available (on development). | ||
Web archive of the German Bundestag | Y | N | N | Web archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years. | ||
Iceland | ||||||
Japan Web Archiving Project | Y | Y | Y | Public access to sites after permission of the site owners. Open access to important publications such as white papers. | ||
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource) | Y | Y | Y | 100% of the archive is indexed. Enables search by topic classification (e.g. Religion, Science, Arts). Search available. | ||
Koninklijke Bibliotheek | The web archive will become available online during the first half of the year 2010. | |||||
New Zealand Web Archive | Y | Y | N | Domain harvests are available to selected staff only using Wayback and limited to URL searchers. Selected harvestings, each website is described in the catalogue (providing subject, author, title and URL searches) and can be viewed by the public via the Internet by clicking on the link to the archived copy. The websites themselves however are not indexed. | ||
The National Library of Norway | N | Y | Sites are integrated in the Catalog. Left bar enables facet navigation with drill-down. | |||
Portuguese Web Archive Portuguese Web Archive The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National... |
Y | Y | Y | 20% of the archive is indexed and na experimental full-text service is available. Archived data can be mined through an Hadoop platform. | ||
Web archive of Čačak | N | N | N | Plans to develop a search engine in the future. One bad characteristic of HTTrack HTTrack HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License.... is that it renames files during the archiving, so the original structure of the website is lost, as well file names. |
||
Web Archive Singapore | ||||||
Slovenian Web Archive | Y | N | N | The archive is not public yet. Plans to implement full-text search. | ||
Digital Preservation of .ES domain | Y (Future) | Y (Future) | Plan to grant access through computers available at a given hall. | |||
Digital Heritage of Catalonia PADICAT PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format... |
Y | Y | Y | Full open access. | ||
Basque Digital Heritage Archive | Y | Y | Y | |||
Sweden (Kulturarw3) | Y | N | N | Public access through dedicated machines in the library building. | ||
Aleph Archives | Y | Y | Y | The full text search engine support automatic metadata extraction, and native results deduplication. Also included: antivirus checker (~250mil. pages/day), archives statistics , text summarizer, archives exports (PDF, PNG, TIFF), etc. | ||
Web Archive Switzerland | Y (in 2011) | Y (in 2011) | The archived versions of the sites are not yet accessible. Web Archive Switzerland will be open to the public by spring 2011 - only access within the National Library and the partner libraries will be possible. The sites are being catalogued and the records are integrated in our library catalog Helveticat. | |||
NTU Web Archiving System, NTUWAS | Y | Y | Y | Presents page thumbnails, archived pages mapped to geographical locations. | ||
Web Archive Taiwan | Y | Y | Y | |||
PageFreezer | Y | Y | Y | Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry. | ||
The UK Web Archive | Y | Y | N | |||
Hanzo Archives | Y | Y | Y | Commercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | ||
UK Government Web Archive | Y | Y | Y | Full text search is operational on the UK Government Web Archive. Users can browse the collection using a full A-Z list of all sites and a set of categories. | ||
Internet Archive Internet Archive The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive... (provides Archive-it service) |
Y | Y | Y | URL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools av_tools and p2 platform for parallel processing. It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing. |
||
Reed Technology Web Archiving Services | ||||||
Columbia University Libraries Web Resources Collection Program | Y | Y | Y | Accessible through Archive-it service. | ||
North Carolina State Government Web Site Archives | Y | Y | Y | Accessible through Archive-it service. | ||
Latin American Web Archiving Project | Y | Y | Y | Content can be accessed via full-text search, or by browsing by country or by specialized sample collection. | ||
Web Archiving Project for the Pacific Islands | Y | Y | Y | Supported by Archive-it service. | ||
Library of Congress Web Archives | Y | Y | N | Access provided via http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html. Records in MODS (Metadata Object Descriptive Schema) format. | ||
Harvard University Library: the Web Archive Collection Service (WAX) | Y | Y | Y | |||
Web Archiving Service from California Digital Library (WAS service) | Y | Y | Y | Access for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search. | ||
University of Michigan Web Archives Project | Y | Y | Y | Powered by the WAS from the California Digital Library. Access is public but usage is restricted for private study, scholarship and research. | ||
University of Texas at San Antonio Web Archives | Y | Y | Y | Accessible through Archive-it service and the Texas Archival Repositories Online database |