Link rot - AbsoluteAstronomy.com

Link rot also known as link death or link breaking is an informal term for the process by which, either on individual website

Website

A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

s or the Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

in general, increasing numbers of links

Hyperlink

In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...

point to web page

Web page

A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s, servers

Server (computing)

In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

or other resources that have become permanently unavailable. The phrase also describes the effects of failing to update out-of-date web page

Web page

s that clutter search engine

Search engine

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

results. A link that does not work any more is called a broken link, dead link or dangling link.

Causes

A link may become broken for several reasons:
The most common result of a dead link is a 404 error

HTTP 404

The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested. A 404 error should not be confused with "server not found" or similar errors, in which a connection to the...

, which indicates that the web server responded, but the specific page could not be found.

Some news sites contribute to the link rot problem by keeping only recent news articles online where they are freely accessible at their original URLs, then removing them or moving them to a paid subscription area. This causes a heavy loss of supporting links in sites discussing newsworthy events and using news sites as references.

Another type of dead link occurs when the server that hosts the target page stops working or relocates to a new domain name

Domain name

A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....

.
In this case the browser may return a DNS

Domain name system

The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...

error, or it may display a site unrelated to the content sought. The latter can occur when a domain name

Domain name

is allowed to lapse, and is subsequently reregistered by another party. Domain names acquired in this manner are attractive to those who wish to take advantage of the stream of unsuspecting surfers that will inflate hit counters and PageRank

PageRank

PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

ing.

A link might also be broken because of some form of blocking such as content filters or firewalls.
Dead links commonplace on the Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

can also occur on the authoring side, when website content is assembled, copied, or deployed without properly verifying the targets, or simply not kept up to date.

Prevalence

The 404 "Not Found"

HTTP 404

response is familiar to even the occasional Web user. A number of studies have examined the prevalence of link rot on the Web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the internet. McCown et al. (2005) discovered that half of the URLs

Uniform Resource Locator

In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

cited in D-Lib Magazine

D-Lib Magazine

D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Current and past issues are available free of charge. The publication is financially supported by contributions from the...

articles were no longer accessible 10 years after publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.

Discovering

Detecting link rot for a given URL

Uniform Resource Locator

In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

is difficult using automated methods. If a URL is accessed and returns an HTTP 200 (OK) response, it may be considered accessible, but the contents of the page may have changed and may no longer be relevant. Some web servers also return a soft 404, a page returned with a 200 (OK) response (instead of a 404 that indicates the URL is no longer accessible). Bar-Yossef et al. (2004) developed a heuristic for automatically discovering soft 404s.

Combating

Due to the unprofessional image that dead links bring to both sites linking and linked to, there are multiple solutions that are available to tackle them — some working to prevent them in the first place, and others trying to resolve them when they have occurred.
There are several tools that have been developed to help combat link rot.

Server side

Avoiding unmanaged hyperlink collections
Avoiding links to pages deep in a website ("deep linking
Deep linking
On the World Wide Web, deep linking is making a hyperlink that points to a specific page or image on a website, instead of that website's main or home page. Such links are called deep links.-Example:...

")
Using redirection
URL redirection
URL redirection, also called URL forwarding and the very similar technique domain redirection also called domain forwarding, are techniques on the World Wide Web for making a web page available under many URLs.- Similar domain names :...

mechanisms (e.g. "301: Moved Permanently") to automatically refer browsers and crawlers to the new location of a URL
Content Management Systems
Web content management system
A web content management system is a software system that provides website authoring, collaboration, and administration tools designed to allow users with little knowledge of web programming languages or markup languages to create and manage website content with relative ease...

may offer inbuilt solutions to the management of links, e.g. links are updated when content is changed or moved on the site.
WordPress
WordPress
WordPress is a free and open source blogging tool and publishing platform powered by PHP and MySQL. It is often customized into a content management system . It has many features including a plug-in architecture and a template system. WordPress is used by over 14.7% of Alexa Internet's "top 1...

guards against link rot by replacing non-canonical URLs with their canonical versions.
IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

's Peridot attempts to automatically fix broken links.
Permalink
Permalink
A permalink is a URL that points to a specific blog or forum entry after it has passed from the front page to the archives. Because a permalink remains unchanged indefinitely, it is less susceptible to link rot. Most modern weblogging and content-syndication software systems support such links...

ing stops broken links by guaranteeing that the content will never move. Another form of permalinking is linking to a permalink that then redirects to the actual content, ensuring that even though the real content may be moved etc., links pointing to the resources stay intact.

User side

The Linkgraph widget gets the URL of the correct page based upon the old broken URL by using historical location information.
The Google 404 Widget employs Google technology to 'guess' the correct URL, and also provides the user a Google search box to find the correct page.
When a user receives a 404 response, the Google Toolbar
Google Toolbar
Google Toolbar is an Internet browser toolbar only available for Internet Explorer and Firefox .-Google Toolbar 1.0 December 11, 2000:New features:*Direct access to the Google search functionality from any web page*Web Site search...

attempts to assist the user in finding the missing page.
Deadurl.com gathers and ranks alternate urls for a broken link using Google Cache, the Internet Archive, and user submissions. Typing deadurl.com/ left of a broken link in the browser's address bar and pressing enter loads a ranked list of alternate urls, or (depending on user preference) immediately forwards to the best one.

Web archiving

To combat link rot, web archivists

Web archiving

Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...

are actively engaged in collecting the Web

World Wide Web

The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

or particular portions of the Web and ensuring the collection is preserved

Digital preservation

Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

in an archive

Authors citing URLs

A number of studies have shown how widespread link rot is in academic literature (see below). Authors of scholarly publications have also developed best practices for combating link rot in their work:

Avoiding URL citations that point to resources on a researcher's personal home page (McCown et al., 2005)
Using Persistent Uniform Resource Locators (PURLs)
Persistent Uniform Resource Locator
A persistent uniform resource locator is a Uniform Resource Locator that is used to redirect to the location of the requested Web resource. PURLs redirect HTTP clients using HTTP status codes...

and digital object identifiers (DOIs)
Digital object identifier
A digital object identifier is a character string used to uniquely identify an object such as an electronic document. Metadata about the object is stored in association with the DOI name and this metadata may include a location, such as a URL, where the object can be found...

whenever possible
Using web archiving services (e.g. WebCite
WebCite
WebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...

) to permanently archive and retrieve cited Internet references (Eysenbach and Trudel, 2005).

External links

Future-Proofing Your URIs
Jakob Nielsen
Jakob Nielsen (usability consultant)
Jakob Nielsen is a leading web usability consultant. He holds a Ph.D. in human–computer interaction from the Technical University of Denmark in Copenhagen.-Early life and background:...

, "Fighting Linkrot", Jakob Nielsen's Alertbox, June 14, 1998.
Warrick - a tool for recovering lost websites from the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

and search engine caches
Pagefactor and UndeadLinks.com - user-contributed databases of moved URLs
W3C Link Checker
mod_brokenlink - Apache
Apache HTTP Server
The Apache HTTP Server, commonly referred to as Apache , is web server software notable for playing a key role in the initial growth of the World Wide Web. In 2009 it became the first web server software to surpass the 100 million website milestone...

module that reports broken links.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Causes

Prevalence

Discovering

Combating

Server side

User side

Web archiving

Authors citing URLs

See also

External links