Web ARChive
Encyclopedia
The Web ARChive archive format
specifies a method for combining multiple digital resources into an aggregate archive file
together with related information. The WARC format is a revision of the Internet Archive
's ARC File Format
[ARC_IA] that has traditionally been used to store "web crawls"
as sequences of content blocks harvested from the World Wide Web
. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata
, abbreviated duplicate detection events, and later-date transformations.
Archive format
An archive format is the file format of an archive file. The archive format is determined by the file archiver. Some archive formats are well-defined by their authors and have become conventions supported by multiple vendors and/or open-source communities....
specifies a method for combining multiple digital resources into an aggregate archive file
Archive file
An archive file is a file that is composed of one or more files along with metadata that can include source volume and medium information, file directory structure, error detection and recovery information, file comments, and usually employs some form of lossless compression. Archive files may be...
together with related information. The WARC format is a revision of the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...
's ARC File Format
ARC (file format)
ARC is a lossless data compression and archival format by System Enhancement Associates . It was very popular during the early days of networked dial-up BBS. The file format and the program were both called ARC...
[ARC_IA] that has traditionally been used to store "web crawls"
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
as sequences of content blocks harvested from the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
, abbreviated duplicate detection events, and later-date transformations.
External links
- http://archive-access.sourceforge.net/warc/
- http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
- http://www.iso.org/iso/pressrelease.htm?refid=Ref1255
- http://www.archive.org/about/about.php