List of Web Archiving Initiatives
Encyclopedia
This page contains a list of Web archiving
Web archiving
Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...

 initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data and access methods.

Web archiving initiatives

Country Technologies Comments
Full-time Part-time
Australia's Web Archive
Pandora Archive
PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting...

Australia
Australia
Australia , officially the Commonwealth of Australia, is a country in the Southern Hemisphere comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands in the Indian and Pacific Oceans. It is the world's sixth-largest country by total area...

1996 PANDORA Digital Archiving System (PANDAS), NLA Trove, HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

.
4 >4.25 It is a collaborative program of 11 agencies that provide an estimate average monthly staffing equivalent to 4 FTE. IT outsourced support: 0.25 person-month. Whole Domain Harvests are conducted by the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

 using Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

.
Our digital island, a Tasmanian Web Archive Australia
Australia
Australia , officially the Commonwealth of Australia, is a country in the Southern Hemisphere comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands in the Indian and Pacific Oceans. It is the world's sixth-largest country by total area...

1996 HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

, Experimentally: Web Curator, Heritrix and Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

1
PageFreezer Canada
Canada
Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean...

, US, Netherlands
Netherlands
The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders...

, Belgium
Belgium
Belgium , officially the Kingdom of Belgium, is a federal state in Western Europe. It is a founding member of the European Union and hosts the EU's headquarters, and those of several other major international organisations such as NATO.Belgium is also a member of, or affiliated to, many...

2005 PageFreezer's Deep Web Crawler, Lucene, Solr Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws.
Web@rchive Austria Austria
Austria
Austria , officially the Republic of Austria , is a landlocked country of roughly 8.4 million people in Central Europe. It is bordered by the Czech Republic and Germany to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the...

2008 Archive-access tools and NetarchiveSuite.dk 2
DILIMAG (Digital Literature Magazines) Austria
Austria
Austria , officially the Republic of Austria , is a landlocked country of roughly 8.4 million people in Central Europe. It is bordered by the Czech Republic and Germany to the north, Slovakia and Hungary to the east, Slovenia and Italy to the south, and Switzerland and Liechtenstein to the...

2007 WebCurator 2 One technician, one for collecting and metadata.
Government of Canada Web Archive (GCWA) Canada
Canada
Canada is a North American country consisting of ten provinces and three territories. Located in the northern part of the continent, it extends from the Atlantic Ocean in the east to the Pacific Ocean in the west, and northward into the Arctic Ocean...

2005 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

 and Nutchwax.
2
Web Information Collection and Preservation - WICP (Chinese Web Archive) China
China
Chinese civilization may refer to:* China for more general discussion of the country.* Chinese culture* Greater China, the transnational community of ethnic Chinese.* History of China* Sinosphere, the area historically affected by Chinese culture...

2003 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

 and Nutchwax.
Croatian Web Archive (Hrvatski arhiv weba - HAW) Croatia
Croatia
Croatia , officially the Republic of Croatia , is a unitary democratic parliamentary republic in Europe at the crossroads of the Mitteleuropa, the Balkans, and the Mediterranean. Its capital and largest city is Zagreb. The country is divided into 20 counties and the city of Zagreb. Croatia covers ...

2004 Lucene
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

4 3 2 librarians full time, 2 librarians part time, 1 IT professional (National and University Library in Zagreb), 1 or 2 IT professionals (from Zagreb University Computing Centre (Srce)- our partner)
WebArchiv
WebArchiv
WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation....

 (National Library of the Czech Republic)
Czech Republic
Czech Republic
The Czech Republic is a landlocked country in Central Europe. The country is bordered by Poland to the northeast, Slovakia to the east, Austria to the south, and Germany to the west and northwest....

2000 Nutch, NutchWAX and WERA tools. 5 3.5 FTE library staff + approx. 1.5 FTE technical staff
Netarkivet.dk Denmark
Denmark
Denmark is a Scandinavian country in Northern Europe. The countries of Denmark and Greenland, as well as the Faroe Islands, constitute the Kingdom of Denmark . It is the southernmost of the Nordic countries, southwest of Sweden and south of Norway, and bordered to the south by Germany. Denmark...

2005 NetarchiveSuite.dk and Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

.
18 18 people involved (developers, librarians, operations staff, project managers). All together 5 FTE.
Finnish Web Archive Finland
Finland
Finland , officially the Republic of Finland, is a Nordic country situated in the Fennoscandian region of Northern Europe. It is bordered by Sweden in the west, Norway in the north and Russia in the east, while Estonia lies to its south across the Gulf of Finland.Around 5.4 million people reside...

2008 NutchWAX 2 >2 Group of librarians that in part-time select what to archive from the Finnish web space.
BnF - BnF Web Legal Deposit France
France
The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France...

2006 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

 and NutchWAX. NetarchiveSuite.
9
Ina (Institut National de l'Audiovisuel)
Institut national de l'audiovisuel
The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia...

France
France
The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France...

2009 Crawl : PhagoSite, Croket, Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

 / Access : Dowser
6 Staff of 80 documentalists taking part in nominating sites and QA
E-diaspora (Télécom ParisTech, FMSH) France
France
The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France...

2010 Crawl : PhagoSite 1 30 researchers taking part in nominating sites
Internet Memory Foundation (ATN service) France
France
The French Republic , The French Republic , The French Republic , (commonly known as France , is a unitary semi-presidential republic in Western Europe with several overseas territories and islands located on other continents and in the Indian, Pacific, and Atlantic oceans. Metropolitan France...

, Netherlands
Netherlands
The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders...

2004 IM large scale crawler (under development), Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Hanzo's crawler, IM Access software. Storage of Web Content: Hbase
HBase
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop...

21 0 11 people for quality crawls (QA, crawl engineering, project management), 9 developers & infrastructure, 1 manager.
Bibliotheksservice-Zentrum Baden-Württemberg Germany
Germany
Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate...

2003 7.5
Web archive of the German Bundestag Germany
Germany
Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate...

2005
Iceland Iceland
Iceland
Iceland , described as the Republic of Iceland, is a Nordic and European island country in the North Atlantic Ocean, on the Mid-Atlantic Ridge. Iceland also refers to the main island of the country, which contains almost all the population and almost all the land area. The country has a population...

2004 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

Japan Web Archiving Project Japan
Japan
Japan is an island nation in East Asia. Located in the Pacific Ocean, it lies to the east of the Sea of Japan, China, North Korea, South Korea and Russia, stretching from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south...

2004 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Solr. Previously: Wget, Accela BizSearch
10 2 Launched in April 2004 as a pilot project, WARP (Web Archiving Project) has been in full-scale operation since July 2007.
National Library of Korea - OASIS (Online Archiving & Searching Internet Sources) Korea
Korea
Korea ) is an East Asian geographic region that is currently divided into two separate sovereign states — North Korea and South Korea. Located on the Korean Peninsula, Korea is bordered by the People's Republic of China to the northwest, Russia to the northeast, and is separated from Japan to the...

2001 Own system based on Oracle DBMS and specialized search engine (IRS) that performs data management and search function. 3 11
Koninklijke Bibliotheek Netherlands
Netherlands
The Netherlands is a constituent country of the Kingdom of the Netherlands, located mainly in North-West Europe and with several islands in the Caribbean. Mainland Netherlands borders the North Sea to the north and west, Belgium to the south, and Germany to the east, and shares maritime borders...

2006 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, KB e-Depot system
1 ~7
National Library of Latvia Latvia
Latvia
Latvia , officially the Republic of Latvia , is a country in the Baltic region of Northern Europe. It is bordered to the north by Estonia , to the south by Lithuania , to the east by the Russian Federation , to the southeast by Belarus and shares maritime borders to the west with Sweden...

2005 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

1 Currently only storing for preservation, access to public in development (ETA June 2012). The latvian term for web harvesting is "rasmošana".
New Zealand Web Archive New Zealand
New Zealand
New Zealand is an island country in the south-western Pacific Ocean comprising two main landmasses and numerous smaller islands. The country is situated some east of Australia across the Tasman Sea, and roughly south of the Pacific island nations of New Caledonia, Fiji, and Tonga...

1999 Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

3 >10 3-4 people at the National Library (various hours) and 2 people at the Internet Archive during the time of domain harvests.
Selective web archiving = 3 full time staff.
Technical services = 1 staff member responds to technical problems when they arise.
National Digital library = 2-3 staff members ad hoc.
NDHA (National Digital Heritage Archive) = various staff members respond to web archiving issues as they arise.
The National Library of Norway Norway
Norway
Norway , officially the Kingdom of Norway, is a Nordic unitary constitutional monarchy whose territory comprises the western portion of the Scandinavian Peninsula, Jan Mayen, and the Arctic archipelago of Svalbard and Bouvet Island. Norway has a total area of and a population of about 4.9 million...

Portuguese Web Archive
Portuguese Web Archive
The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National...

Portugal
Portugal
Portugal , officially the Portuguese Republic is a country situated in southwestern Europe on the Iberian Peninsula. Portugal is the westernmost country of Europe, and is bordered by the Atlantic Ocean to the West and South and by Spain to the North and East. The Atlantic archipelagos of the...

2007 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, NutchWAX
4 1
Web archive of Čačak Serbia
Serbia
Serbia , officially the Republic of Serbia , is a landlocked country located at the crossroads of Central and Southeast Europe, covering the southern part of the Carpathian basin and the central part of the Balkans...

2009 HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

1
Web Archive Singapore Singapore
Singapore
Singapore , officially the Republic of Singapore, is a Southeast Asian city-state off the southern tip of the Malay Peninsula, north of the equator. An island country made up of 63 islands, it is separated from Malaysia by the Straits of Johor to its north and from Indonesia's Riau Islands by the...

Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, NutchWAX, WERA
Slovenian Web Archive Slovenia
Slovenia
Slovenia , officially the Republic of Slovenia , is a country in Central and Southeastern Europe touching the Alps and bordering the Mediterranean. Slovenia borders Italy to the west, Croatia to the south and east, Hungary to the northeast, and Austria to the north, and also has a small portion of...

2007 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

1
Digital Preservation of .ES domain Spain
Spain
Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula...

2006 Internet Archive 2 >2 Can pool additional resources if necessary from computing controllers and financial department.
Digital Heritage of Catalonia
PADICAT
PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format...

Spain
Spain
Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula...

2006 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, WERA, Nutchwax and Web Curator.
4
Basque Digital Heritage Archive Spain
Spain
Spain , officially the Kingdom of Spain languages]] under the European Charter for Regional or Minority Languages. In each of these, Spain's official name is as follows:;;;;;;), is a country and member state of the European Union located in southwestern Europe on the Iberian Peninsula...

2008 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, Nutchwax and Web Curator.
1
Sweden (Kulturarw3) Sweden
Sweden
Sweden , officially the Kingdom of Sweden , is a Nordic country on the Scandinavian Peninsula in Northern Europe. Sweden borders with Norway and Finland and is connected to Denmark by a bridge-tunnel across the Öresund....

1996 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

. Own system for storage, maintenance and access
1.25 Paus in operation november 2009 - may 2011.
Aleph Archives Switzerland
Switzerland
Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition....

/USA
2010 Distributed crawler, ArchiView access plugin, High performance search engine, Near real time indexing, Web Monitoring tools 7 Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...).
Web Archive Switzerland Switzerland
Switzerland
Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition....

2008 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

3 1 crawl engineer, 1 person for quality assurance, 1 coordinator. The curators, who do the selection, are partner libraries all over Switzerland.
NTU Web Archiving System, NTUWAS Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...

2007 Lucene
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

3
Web Archive Taiwan Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...

2007
The UK Web Archive UK 2004 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Web Curator Tool, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

 and moving to Solr for searching.
Hanzo Archives UK 2006 Hanzo Crawler, Search, and Access Tools. Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive UK 2004 ATN Service 4 2 Technical side of our web archiving operation is contracted out to the Internet Memory Foundation so the figures account for QA, curatorial and management staff only
Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

 (provides Archive-it service)
USA 1996 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, NutchWAX and other tools developed by the Internet Archive
12
Reed Technology Web Archiving Services USA 2010 TrueArchive™ Technology Reed Technology Web Archiving Services provides support for Litigation Protection, Compliance, e-Discovery and Social Media Management.
Columbia University Libraries Web Resources Collection Program USA 2009 Archive-it service 3 >1 Part-time consultation/supervision from other librarians adding up to about 1 FTE.
North Carolina State Government Web Site Archives USA 2005 Archive-it service 3
Latin American Web Archiving Project USA 2005 Archive-it service
Web Archiving Project for the Pacific Islands USA Archive-it service 4
Library of Congress Web Archives USA 2000 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, and the DigiBoard, an in-house curatorial/permissions tool
6 80 The part time workers spend a few hours per month (on average) selecting content for the collections.
Harvard University Library: the Web Archive Collection Service (WAX) USA 2006 Own system based on Archive-access and other open-source tools. >6 3 part time on IT support. External curators within 3 units but don't know the size of them.
Web Archiving Service from California Digital Library (WAS service) USA 2005 Heritix, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

, NutchWAX
4 >1 The number of hours that curators devote to the service is very variable.
University of Michigan Web Archives Project USA 2000 WAS service 2
University of Texas at San Antonio Web Archives USA 2009 Archive-It 3 The number of hours varies dependent upon how the crawls are scheduled.
qumram Switzerland
Switzerland
Switzerland name of one of the Swiss cantons. ; ; ; or ), in its full name the Swiss Confederation , is a federal republic consisting of 26 cantons, with Bern as the seat of the federal authorities. The country is situated in Western Europe,Or Central Europe depending on the definition....

2010 Chronos Web Archiving Software Suite Commercial web archiving software suite. Provides both harvesting as well as transactional web archiving. Allows integrations with any possible repository (database, file system, electronic archive or records management system). Specializes on regulatory compliance.
SAPERION Germany
Germany
Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate...

2011 SAPERION ECM Web Content Archive Commercial enterprise content management
Enterprise content management
Enterprise Content Management is a formalized means of organizing and storing an organization's documents, and other content, that relate to the organization's processes...

 suite specializes on regulatory compliance. The product provides both harvesting as well as transactional web archiving based on the integration of qumram´s Chronos Web Archiving Software Suite. Web content is just another chanel from which content is reaching SAPERION. Others may be scanner, fax, e-mail, mobiles devices, office suites or any other system creating content like ERP
ERP
- Economics :* Economic Report of the President, published annually by the United States President's Council of Economic Advisors on recent economic activity and future policies and predictions...

 systems.
Bibliotheca Alexandrina
Bibliotheca Alexandrina
The Bibliotheca Alexandrina or Maktabat al-Iskandarīyah is a major library and cultural center located on the shore of the Mediterranean Sea in the Egyptian city of Alexandria...

's Internet Archive
Egypt
Egypt
Egypt , officially the Arab Republic of Egypt, Arabic: , is a country mainly in North Africa, with the Sinai Peninsula forming a land bridge in Southwest Asia. Egypt is thus a transcontinental country, and a major power in Africa, the Mediterranean Basin, the Middle East and the Muslim world...

2002 Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

, Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

3 Current crawling interests: Egypt beyond January 25, Arab League ccTLDs

Archived data

Archived Contents (millions) Archive Format Selective Crawls (Yes/No) Australia's Web Archive
Pandora Archive
PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting...

3100 104.5 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

.AU
.au
.au is the Internet country code top-level domain for Australia.-History:The domain name was originally allocated by Jon Postel, operator of IANA to Kevin Robert Elz of Melbourne University in 1986. After an approximately five year process in the 1990s, the Internet industry created a self...

Y .AU
.au
.au is the Internet country code top-level domain for Australia.-History:The domain name was originally allocated by Jon Postel, operator of IANA to Kevin Robert Elz of Melbourne University in 1986. After an approximately five year process in the 1990s, the Internet industry created a self...

 crawls (2005-2009): 3 billion files (100 TB). Selective crawls (1996-today): 100 million files (4.5 TB). There are 3 copies of each content.
Our digital island, a Tasmanian Web Archive 0.336 HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

Y Preserves online contents related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of ‘Book’ in the Tasmanian Library Act 1984. Thus, no permission to capture from publishers is required.
Web@rchive Austria 455 6.61 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

.AT
.at
.at is the Internet country code top-level domain for Austria. It is administered by .The .at top-level domain has a number of second-level domains...

Y A copy of the data will be stored in a high security data storage unit.
DILIMAG (Digital Literature Magazines) 0.03 0.996 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines.
Government of Canada Web Archive (GCWA) 170 7 Y Selective crawls of the web domain of the Federal Government of Canada (.GC.CA
.gc.ca
.gc.ca is a privately held second level domain for the Government of Canada, run by Government Telecommunications and Informatics Services. They hold all third level domains under the .gc.ca banner.-External links:*http://registry.gc.ca*http://www.gc.ca...

)
Web Information Collection and Preservation - WICP (Chinese Web Archive) .GOV.CN Y Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn'
domain.
Croatian Web Archive (Hrvatski arhiv weba - HAW) 81 3.4 Y
WebArchiv
WebArchiv
WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation....

 (National Library of the Czech Republic)
526 24 .CZ
.cz
.cz is the country code top-level domain for the Czech Republic. It is administered by CZ.NIC. Registrations must be ordered via accredited domain name registrars.Before the split in 1993 former Czechoslovakia used domain .cs....

Y Harvesting began in 2001.
Netarkivet.dk 6008 190 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

.DK
.dk
.dk is the country code top-level domain for Denmark. The supervision of the .dk top-level domain is handled exclusively by DK Hostmaster. Any new .dk domain name has to be applied for via an approved registrator. Then the domain name applicant can ask the registrator to manage his domain name or...

Y It uses NetarchiveSuite.dk was developed by two Danish libraries and Heritrix
Heritrix
Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

.
Finnish Web Archive 494 23 .FI
.fi
.fi is the Internet country code top-level domain for Finland. It is operated by FICORA, the Finnish Communications Regulatory Authority....

, .AX
.ax
.ax is the Internet country code top-level domain of the Åland Islands, introduced in 2006. Previously, most Åland websites were under the .aland.fi subdomain.-History:...

Y Also crawls contents hosted on machines physically located in Finland, independently from their domain.
BnF - BnF Web Legal Deposit 14000 200 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

.FR
.fr
.fr is the country code top-level domain in the Domain Name System of the Internet for France. Along with .re and .tf, it is administered by AFNIC.The .fr top-level domain is divided into a number of second-level domains:...

Y
Ina (Institut National de l'Audiovisuel)
Institut national de l'audiovisuel
The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia...

8400 56 DAFF N Y Digital Archive file format handles file redundancies. The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 665 Tb
E-diaspora (Télécom ParisTech, FMSH) 237 2 DAFF N N Digital Archive file format handles file redundancies.The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 10 Tb
Internet Memory Foundation (ATN service) 180 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Can be done by partners Y Formerly European Archive. Provides the Archive The Net Service (ATN Service). Selective crawls (140 TB), Domain crawls (40 TB), expect to grow to 1PB in 2011. New datacenter and a new crawler in 2011.
Bibliotheksservice-Zentrum Baden-Württemberg 1 HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

Y Bibliotheksservice-Zentrum Baden-Württemberg -German is operating following Web-Archives:
1- Baden-Württembergisches Online-Archiv (BOA)
2- Saardok
3- Literatur im Netz des Deutschen Literaturarchivs Marbach.
Web archive of the German Bundestag Y German Federal Parliament. Selective. At regular intervals or at certain events are snapshots (snapshots) of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available.
Iceland
Japan Web Archiving Project 319.8 38.2 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

- Y 15 TB of selective crawls based on permission (2002-2010). Started the web archiving of official institution sites based on the legislation from April of 2010.
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource) 24 Y Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web arching system will be rebuild.
Koninklijke Bibliotheek 5 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Y
New Zealand Web Archive 346 13 .NZ
.nz
.nz is the Internet country code top-level domain for New Zealand. It is administered by InternetNZ through its subsidiary, NZ Registry Services, with oversight and dispute resolution handled by the Domain Name Commission Ltd. Registrations are processed via authorised registrars...

Y .NZ crawls: 105 million URLs (4.1 TB) in 2008, 170 million URLs (6.1 TB) in 2010. Selective crawls of 7 599 websites in the National Digital Heritage Archive (2.8 TB), 71 million contents estimated. Legal deposit covers born digital
Born Digital
Born Digital: Understanding the First Generation of Digital Natives is a book by John Palfrey and Urs Gasser exploring the consequences of the wide availability of internet connectivity to the first generation of people born to it, who Palfrey and Gasser refer to as "digital natives"...

 material (including websites).
The National Library of Norway
Portuguese Web Archive
Portuguese Web Archive
The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National...

889 25 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

.PT
.pt
.pt is the Internet country code top-level domain for Portugal and is managed by the Fundação para a Computação Científica Nacional .It has the following second-level domains:* .com.pt: no restrictions; online registration* .edu.pt: education...

, .CV
.cv
.cv is the country code top-level domain for Cape Verde. It is managed by the National Communications Agency...

, .AO
.ao
.ao is the Internet country code top-level domain for Angola. It is administered by the college of engineering of the University of Agostinho Neto....

, .MZ
.mz
.mz is the Internet country code top-level domain for Mozambique. Registrations are at the third level beneath the second-level names adv.mz,ac.mz, co.mz, org.mz, gov.mz and edu.mz.-External links:* * *...

Y TLD crawls and integration of external collections since 2007, selective crawls since 2010.
Web archive of Čačak 0.255 0.013 HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

Y Selective crawls of 130 sites related to the city of Čačak. Collaboration with the WebArchiv team from the National Library of the Czech Republic.
Web Archive Singapore .SG
.sg
.sg is the Internet country code top-level domain for Singapore. It is administered by the Singapore Network Information Centre. Registrations are processed via accredited registrars....

Y Selective crawls of 1000 Singapore-related sites, with the written consent of the owners. Whole .SG
.sg
.sg is the Internet country code top-level domain for Singapore. It is administered by the Singapore Network Information Centre. Registrations are processed via accredited registrars....

 domain archiving.
Slovenian Web Archive 1.5 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Selective crawls
Digital Preservation of .ES domain 855 30 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

.ES
.es
.es is the country code top-level domain for Spain. It is administered by the Network Information Center of Spain.Registrations are permitted at the second level or at the third level beneath various generic second level categories. Some qualifications and restrictions apply to third-level...

Collaboration with Internet Archive. Domain crawl of .ES
.es
.es is the country code top-level domain for Spain. It is administered by the Network Information Center of Spain.Registrations are permitted at the second level or at the third level beneath various generic second level categories. Some qualifications and restrictions apply to third-level...

, harvested quarterly. Not launched publicly yet.
Digital Heritage of Catalonia
PADICAT
PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format...

200 7.7 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

.CAT
.cat
.cat is a sponsored top-level domain intended to be used to highlight the Catalan language and culture. Its policy has been developed by ICANN and Fundació puntCAT...

Y In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet (.cat); Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life (elections, museums, etc.)
Basque Digital Heritage Archive 21 0.8 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Y
Sweden (Kulturarw3) 1710 71.3 Multipart MIME .se, Swedish .nu and geolocation for other tld's Y Bulk crawls approximately twice a year.
Selective crawls of about 140 newspapers every day.
Aleph Archives 23 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

, WARC2, ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

 and HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

 to WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

 migration tools
Y Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...).
Web Archive Switzerland 0.1 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Y
NTU Web Archiving System, NTUWAS 200 14 Y
Web Archive Taiwan
The UK Web Archive 6.9 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Y Selective crawls with previous permission. Expect to run wholesale UK domain-scale crawls once Legal Deposit legislation is implemented in April 2011. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Hanzo Archives 7 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive 32 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

The UKGWA is a spin-off from the UK Web Archiving Consortium that ended in 2007.
Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

 (provides Archive-it service)
150000 5500 World-wide Y Provides the Archive-it service and leads the Archive-access project (Internet Archive ARC access tools). Collection is mirrored at Bibliotheca of Alexandrina in Egypt.
Reed Technology Web Archiving Services
Columbia University Libraries Web Resources Collection Program 23.1 1.8 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y Selective crawls with permission or notification; primarily thematic collections.
North Carolina State Government Web Site Archives 51.5 3.8 WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y
Latin American Web Archiving Project Y
Web Archiving Project for the Pacific Islands 5.5 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y Includes sites of 18 countries.
Library of Congress Web Archives 5 230 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections.
Harvard University Library: the Web Archive Collection Service (WAX) 19 0.661 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

Y Selective crawls with no previous authorization.
Web Archiving Service from California Digital Library (WAS service) 216 25.2 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Can be done by partners Y Provides Web Archiving Service (WAS) to partners world-wide. Was developed at the California Digital Library.
University of Michigan Web Archives Project 0.65 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y WAS service since 2010.
University of Texas at San Antonio Web Archives 26 1.135 ARC
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...

/WARC
Web ARChive
The Web ARChive archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] that has traditionally been used to store "web crawls" as...

Y University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues .

Access methods

URL history (Yes/No) Full-text search (Yes/No) Australia's Web Archive
Pandora Archive
PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting...

N Y Y Selected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive
Pandora Archive
PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting...

 is indexed and searchable through the NLA's single search service Trove.
The Australian Domain Harvests are full-text indexed but are not currently publicly available.
Our digital island, a Tasmanian Web Archive Y Y N Presents thumbnails generated through Html To Image supplemented in HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

. Information is organized in directory: A-Z Subject listing, A-Z Title listing.
Web@rchive Austria Y N N Only accessible on special terminals at the Austrian National Library
Austrian National Library
The Austrian National Library , is the largest library in Austria, with 7.4 million items in its collections. It is located in the Hofburg Palace in Vienna; since 2005 some of the collections are located in the baroque Palais Mollard-Clary...

. Presents thumbnail previews of archived pages and supports keyword search within URL.
DILIMAG (Digital Literature Magazines) Y Y N Metadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search was not implemented due to lack of resources.
Government of Canada Web Archive (GCWA) Y Y Y Technical details available.
Web Information Collection and Preservation - WICP (Chinese Web Archive) Y Archive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection.
Croatian Web Archive (Hrvatski arhiv weba - HAW) Y Y Y
WebArchiv
WebArchiv
WebArchiv is a digital archive of important Czech web resources, which are collected with the aim of their long-term preservation....

 (National Library of the Czech Republic)
Y Y Due to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in WebArchiv is available from public terminals in the National Library.
Netarkivet.dk Y N N Online access granted only to researchers using a proxy solution that accesses an archive index. Soon it will set up user access through the Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

. It has established a framework for running batch jobs with the possibility of data mining.
Finnish Web Archive Y N 30% of material. URL search but onsite access to contents. Full-text search is available to 30% of material.
BnF - BnF Web Legal Deposit Y N 15% of the collection Accessible to authorized users of the BnF, through the reading rooms of the Research Library located in Paris and Avignon. Wayback Machine
Wayback Machine
The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

 interface was translated to French. Full Text search only for a relatively small portion of the collection (15% of 200 TB) indexed by Internet Archive. No current full text search implemented in workflow. Builds special collection galleries based on a selection from the archive on a given topic.
Ina (Institut National de l'Audiovisuel)
Institut national de l'audiovisuel
The Institut national de l'audiovisuel , is a repository of all French radio and television audiovisual archives. Additionally it provides customers with a free and immediate access to archives of countries such as Afghanistan and Cambodia...

Y Y Y Full text indexing is based on Lucene. To accommodate results from frequent crawls (up to every 2 hours for home pages) clustering is operated to handle similar versions of pages
E-diaspora (Télécom ParisTech, FMSH) Y N N 1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive (http://ediasporas.ticmigrations.fr/) Ina is hanling crawls and storage
Internet Memory Foundation (ATN service) Y Y Y Provides access and search services according to partners policy.
Bibliotheksservice-Zentrum Baden-Württemberg Y Y Y Search available (on development).
Web archive of the German Bundestag Y N N Web archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years.
Iceland
Japan Web Archiving Project Y Y Y Public access to sites after permission of the site owners. Open access to important publications such as white papers.
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource) Y Y Y 100% of the archive is indexed. Enables search by topic classification (e.g. Religion, Science, Arts). Search available.
Koninklijke Bibliotheek The web archive will become available online during the first half of the year 2010.
New Zealand Web Archive Y Y N Domain harvests are available to selected staff only using Wayback and limited to URL searchers. Selected harvestings, each website is described in the catalogue (providing subject, author, title and URL searches) and can be viewed by the public via the Internet by clicking on the link to the archived copy. The websites themselves however are not indexed.
The National Library of Norway N Y Sites are integrated in the Catalog. Left bar enables facet navigation with drill-down.
Portuguese Web Archive
Portuguese Web Archive
The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National...

Y Y Y 20% of the archive is indexed and na experimental full-text service is available. Archived data can be mined through an Hadoop platform.
Web archive of Čačak N N N Plans to develop a search engine in the future. One bad characteristic of HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

 is that it renames files during the archiving, so the original structure of the website is lost, as well file names.
Web Archive Singapore
Slovenian Web Archive Y N N The archive is not public yet. Plans to implement full-text search.
Digital Preservation of .ES domain Y (Future) Y (Future) Plan to grant access through computers available at a given hall.
Digital Heritage of Catalonia
PADICAT
PADICAT , coordinated by the Biblioteca de Catalunya, is a deposit of digital files, html, jpg and gif, initiated as a project in 2005 with the aim of collecting, processing and providing permanent access to the entire cultural, scientific and general output of Catalonia in digital format...

Y Y Y Full open access.
Basque Digital Heritage Archive Y Y Y
Sweden (Kulturarw3) Y N N Public access through dedicated machines in the library building.
Aleph Archives Y Y Y The full text search engine support automatic metadata extraction, and native results deduplication. Also included: antivirus checker (~250mil. pages/day), archives statistics , text summarizer, archives exports (PDF, PNG, TIFF), etc.
Web Archive Switzerland Y (in 2011) Y (in 2011) The archived versions of the sites are not yet accessible. Web Archive Switzerland will be open to the public by spring 2011 - only access within the National Library and the partner libraries will be possible. The sites are being catalogued and the records are integrated in our library catalog Helveticat.
NTU Web Archiving System, NTUWAS Y Y Y Presents page thumbnails, archived pages mapped to geographical locations.
Web Archive Taiwan Y Y Y
PageFreezer Y Y Y Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry.
The UK Web Archive Y Y N
Hanzo Archives Y Y Y Commercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA.
UK Government Web Archive Y Y Y Full text search is operational on the UK Government Web Archive. Users can browse the collection using a full A-Z list of all sites and a set of categories.
Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

(provides Archive-it service)
Y Y Y URL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools
av_tools and p2 platform for parallel processing. It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing.
Reed Technology Web Archiving Services
Columbia University Libraries Web Resources Collection Program Y Y Y Accessible through Archive-it service.
North Carolina State Government Web Site Archives Y Y Y Accessible through Archive-it service.
Latin American Web Archiving Project Y Y Y Content can be accessed via full-text search, or by browsing by country or by specialized sample collection.
Web Archiving Project for the Pacific Islands Y Y Y Supported by Archive-it service.
Library of Congress Web Archives Y Y N Access provided via http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html. Records in MODS (Metadata Object Descriptive Schema) format.
Harvard University Library: the Web Archive Collection Service (WAX) Y Y Y
Web Archiving Service from California Digital Library (WAS service) Y Y Y Access for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search.
University of Michigan Web Archives Project Y Y Y Powered by the WAS from the California Digital Library. Access is public but usage is restricted for private study, scholarship and research.
University of Texas at San Antonio Web Archives Y Y Y Accessible through Archive-it service and the Texas Archival Repositories Online database
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK