Link rot
Encyclopedia
Link rot also known as link death or link breaking is an informal term for the process by which, either on individual website
s or the Internet
in general, increasing numbers of links
point to web page
s, servers
or other resources that have become permanently unavailable. The phrase also describes the effects of failing to update out-of-date web page
s that clutter search engine
results. A link that does not work any more is called a broken link, dead link or dangling link.
The most common result of a dead link is a 404 error
, which indicates that the web server responded, but the specific page could not be found.
Some news sites contribute to the link rot problem by keeping only recent news articles online where they are freely accessible at their original URLs, then removing them or moving them to a paid subscription area. This causes a heavy loss of supporting links in sites discussing newsworthy events and using news sites as references.
Another type of dead link occurs when the server that hosts the target page stops working or relocates to a new domain name
.
In this case the browser may return a DNS
error, or it may display a site unrelated to the content sought. The latter can occur when a domain name
is allowed to lapse, and is subsequently reregistered by another party. Domain names acquired in this manner are attractive to those who wish to take advantage of the stream of unsuspecting surfers that will inflate hit counters and PageRank
ing.
A link might also be broken because of some form of blocking such as content filters or firewalls.
Dead links commonplace on the Internet
can also occur on the authoring side, when website content is assembled, copied, or deployed without properly verifying the targets, or simply not kept up to date.
response is familiar to even the occasional Web user. A number of studies have examined the prevalence of link rot on the Web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the internet. McCown et al. (2005) discovered that half of the URLs
cited in D-Lib Magazine
articles were no longer accessible 10 years after publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.
is difficult using automated methods. If a URL is accessed and returns an HTTP 200 (OK) response, it may be considered accessible, but the contents of the page may have changed and may no longer be relevant. Some web servers also return a soft 404, a page returned with a 200 (OK) response (instead of a 404 that indicates the URL is no longer accessible). Bar-Yossef et al. (2004) developed a heuristic for automatically discovering soft 404s.
There are several tools that have been developed to help combat link rot.
are actively engaged in collecting the Web
or particular portions of the Web and ensuring the collection is preserved
in an archive
, such as an archive site
, for future researchers, historians, and the public. The largest web archiving organization is the Internet Archive, which strives to maintain an archive of the entire Web, taking periodic snapshots of pages that can then be accessed for free via the Wayback Machine and without registration many years later simply by typing in the URL, or automatically by using browser extensions. National libraries
, national archives and various consortia of organizations are also involved in archiving culturally important Web content.
Individuals may also use a number of tools that allow them to archive web resources that may go missing in the future:
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...
s or the Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
in general, increasing numbers of links
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...
point to web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
s, servers
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
or other resources that have become permanently unavailable. The phrase also describes the effects of failing to update out-of-date web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
s that clutter search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
results. A link that does not work any more is called a broken link, dead link or dangling link.
Causes
A link may become broken for several reasons:The most common result of a dead link is a 404 error
HTTP 404
The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested. A 404 error should not be confused with "server not found" or similar errors, in which a connection to the...
, which indicates that the web server responded, but the specific page could not be found.
Some news sites contribute to the link rot problem by keeping only recent news articles online where they are freely accessible at their original URLs, then removing them or moving them to a paid subscription area. This causes a heavy loss of supporting links in sites discussing newsworthy events and using news sites as references.
Another type of dead link occurs when the server that hosts the target page stops working or relocates to a new domain name
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....
.
In this case the browser may return a DNS
Domain name system
The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...
error, or it may display a site unrelated to the content sought. The latter can occur when a domain name
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....
is allowed to lapse, and is subsequently reregistered by another party. Domain names acquired in this manner are attractive to those who wish to take advantage of the stream of unsuspecting surfers that will inflate hit counters and PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...
ing.
A link might also be broken because of some form of blocking such as content filters or firewalls.
Dead links commonplace on the Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
can also occur on the authoring side, when website content is assembled, copied, or deployed without properly verifying the targets, or simply not kept up to date.
Prevalence
The 404 "Not Found"HTTP 404
The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested. A 404 error should not be confused with "server not found" or similar errors, in which a connection to the...
response is familiar to even the occasional Web user. A number of studies have examined the prevalence of link rot on the Web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the internet. McCown et al. (2005) discovered that half of the URLs
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
cited in D-Lib Magazine
D-Lib Magazine
D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Current and past issues are available free of charge. The publication is financially supported by contributions from the...
articles were no longer accessible 10 years after publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.
Discovering
Detecting link rot for a given URLUniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
is difficult using automated methods. If a URL is accessed and returns an HTTP 200 (OK) response, it may be considered accessible, but the contents of the page may have changed and may no longer be relevant. Some web servers also return a soft 404, a page returned with a 200 (OK) response (instead of a 404 that indicates the URL is no longer accessible). Bar-Yossef et al. (2004) developed a heuristic for automatically discovering soft 404s.
Combating
Due to the unprofessional image that dead links bring to both sites linking and linked to, there are multiple solutions that are available to tackle them — some working to prevent them in the first place, and others trying to resolve them when they have occurred.There are several tools that have been developed to help combat link rot.
Server side
- Avoiding unmanaged hyperlink collections
- Avoiding links to pages deep in a website ("deep linkingDeep linkingOn the World Wide Web, deep linking is making a hyperlink that points to a specific page or image on a website, instead of that website's main or home page. Such links are called deep links.-Example:...
") - Using redirectionURL redirectionURL redirection, also called URL forwarding and the very similar technique domain redirection also called domain forwarding, are techniques on the World Wide Web for making a web page available under many URLs.- Similar domain names :...
mechanisms (e.g. "301: Moved Permanently") to automatically refer browsers and crawlers to the new location of a URL - Content Management SystemsWeb content management systemA web content management system is a software system that provides website authoring, collaboration, and administration tools designed to allow users with little knowledge of web programming languages or markup languages to create and manage website content with relative ease...
may offer inbuilt solutions to the management of links, e.g. links are updated when content is changed or moved on the site. - WordPressWordPressWordPress is a free and open source blogging tool and publishing platform powered by PHP and MySQL. It is often customized into a content management system . It has many features including a plug-in architecture and a template system. WordPress is used by over 14.7% of Alexa Internet's "top 1...
guards against link rot by replacing non-canonical URLs with their canonical versions. - IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
's Peridot attempts to automatically fix broken links. - PermalinkPermalinkA permalink is a URL that points to a specific blog or forum entry after it has passed from the front page to the archives. Because a permalink remains unchanged indefinitely, it is less susceptible to link rot. Most modern weblogging and content-syndication software systems support such links...
ing stops broken links by guaranteeing that the content will never move. Another form of permalinking is linking to a permalink that then redirects to the actual content, ensuring that even though the real content may be moved etc., links pointing to the resources stay intact.
User side
- The Linkgraph widget gets the URL of the correct page based upon the old broken URL by using historical location information.
- The Google 404 Widget employs Google technology to 'guess' the correct URL, and also provides the user a Google search box to find the correct page.
- When a user receives a 404 response, the Google ToolbarGoogle ToolbarGoogle Toolbar is an Internet browser toolbar only available for Internet Explorer and Firefox .-Google Toolbar 1.0 December 11, 2000:New features:*Direct access to the Google search functionality from any web page*Web Site search...
attempts to assist the user in finding the missing page. - Deadurl.com gathers and ranks alternate urls for a broken link using Google Cache, the Internet Archive, and user submissions. Typing deadurl.com/ left of a broken link in the browser's address bar and pressing enter loads a ranked list of alternate urls, or (depending on user preference) immediately forwards to the best one.
Web archiving
To combat link rot, web archivistsWeb archiving
Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...
are actively engaged in collecting the Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
or particular portions of the Web and ensuring the collection is preserved
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...
in an archive
Archive
An archive is a collection of historical records, or the physical place they are located. Archives contain primary source documents that have accumulated over the course of an individual or organization's lifetime, and are kept to show the function of an organization...
, such as an archive site
Archive site
In web archiving, an archive site is a website that stores information on, or the actual, webpages from the past for anyone to view.-Common techniques:Two common techniques are #1 using a web crawler or #2 user submissions....
, for future researchers, historians, and the public. The largest web archiving organization is the Internet Archive, which strives to maintain an archive of the entire Web, taking periodic snapshots of pages that can then be accessed for free via the Wayback Machine and without registration many years later simply by typing in the URL, or automatically by using browser extensions. National libraries
National library
A national library is a library specifically established by the government of a country to serve as the preeminent repository of information for that country. Unlike public libraries, these rarely allow citizens to borrow books...
, national archives and various consortia of organizations are also involved in archiving culturally important Web content.
Individuals may also use a number of tools that allow them to archive web resources that may go missing in the future:
- WebCiteWebCiteWebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...
, a tool specifically for scholarly authors, journal editors and publishers to permanently archive "on-demand" and retrieve cited Internet references (Eysenbach and Trudel, 2005). - Archive-It, a subscription service that allows institutions to build, manage and search their own web archive
- Some social bookmarkingSocial bookmarkingSocial bookmarking is a method for Internet users to organize, store, manage and search for bookmarks of resources online. Unlike file sharing, the resources themselves aren't shared, merely bookmarks that reference them....
websites, such as FurlFurlFurl was a free social bookmarking website that allowed members to store searchable copies of webpages and share them with others. Every member received 5 gigabytes of storage space. The site was founded by Mike Giles in 2003 and purchased by LookSmart in 2004...
, make private copies of web pages bookmarked by their users. - GoogleGoogleGoogle Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
keeps a text-based cache (temporary copy) of the pages it has crawledWeb crawlerA Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
, which can be used to read the information of recently removed pages. However, unlike in archiving services, cached pages are not stored permanently. - The WayBack Machine, at the Internet ArchiveInternet ArchiveThe Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...
, is a free website that archives old web pages. It does not archive websites whose owners have stated they do not want their website archived.
Authors citing URLs
A number of studies have shown how widespread link rot is in academic literature (see below). Authors of scholarly publications have also developed best practices for combating link rot in their work:- Avoiding URL citations that point to resources on a researcher's personal home page (McCown et al., 2005)
- Using Persistent Uniform Resource Locators (PURLs)Persistent Uniform Resource LocatorA persistent uniform resource locator is a Uniform Resource Locator that is used to redirect to the location of the requested Web resource. PURLs redirect HTTP clients using HTTP status codes...
and digital object identifiers (DOIs)Digital object identifierA digital object identifier is a character string used to uniquely identify an object such as an electronic document. Metadata about the object is stored in association with the DOI name and this metadata may include a location, such as a URL, where the object can be found...
whenever possible - Using web archiving services (e.g. WebCiteWebCiteWebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...
) to permanently archive and retrieve cited Internet references (Eysenbach and Trudel, 2005).
See also
- Digital preservationDigital preservationDigital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...
- Internet ArchiveInternet ArchiveThe Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...
- PermalinkPermalinkA permalink is a URL that points to a specific blog or forum entry after it has passed from the front page to the archives. Because a permalink remains unchanged indefinitely, it is less susceptible to link rot. Most modern weblogging and content-syndication software systems support such links...
- Slashdot effectSlashdot effectThe Slashdot effect, also known as slashdotting, occurs when a popular website links to a smaller site, causing a massive increase in traffic. This overloads the smaller site, causing it to slow down or even temporarily close. The name stems from the huge influx of web traffic that results from...
- Web archivingWeb archivingWeb archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...
- WebCiteWebCiteWebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...
External links
- Future-Proofing Your URIs
- Jakob NielsenJakob Nielsen (usability consultant)Jakob Nielsen is a leading web usability consultant. He holds a Ph.D. in human–computer interaction from the Technical University of Denmark in Copenhagen.-Early life and background:...
, "Fighting Linkrot", Jakob Nielsen's Alertbox, June 14, 1998. - Warrick - a tool for recovering lost websites from the Internet ArchiveInternet ArchiveThe Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...
and search engine caches - Pagefactor and UndeadLinks.com - user-contributed databases of moved URLs
- W3C Link Checker
- mod_brokenlink - ApacheApache HTTP ServerThe Apache HTTP Server, commonly referred to as Apache , is web server software notable for playing a key role in the initial growth of the World Wide Web. In 2009 it became the first web server software to surpass the 100 million website milestone...
module that reports broken links.