URL normalization
Encyclopedia
URL normalization is the process by which URLs
are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical
URL so it is possible to determine if two syntactically different URLs may be equivalent.
Search engine
s employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawler
s perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.
These normalizations can be applied on URLs without changing the semantics.
appears in a crawl log several times along with
we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical
Canonical
Canonical is an adjective derived from canon. Canon comes from the greek word κανών kanon, "rule" or "measuring stick" , and is used in various meanings....
URL so it is possible to determine if two syntactically different URLs may be equivalent.
Search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
s employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
s perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.
Normalization process
There are several types of normalization that may be performed. Some of them are semantics preserving and some are not.Semantic preserving normalizations
The following normalizations are described in RFC 3986 to result in equivalent URLs:- Converting the scheme and host to lower case. The scheme and host components of the URL are case-insensitive. Most normalizers will convert them to lowercase. Example:
→HTTP://www.Example.com/ http://www.example.com/ - Capitalizing letters in escape sequences. All letters within a percent-encodingPercent-encodingPercent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier set, which includes both Uniform...
triplet (e.g., "%3A") are case-insensitive, and should be capitalized. Example:
- Capitalizing letters in escape sequences. All letters within a percent-encoding
→http://www.example.com/a%c2%b1b http://www.example.com/a%C2%B1b - Decoding percent-encoded octets of unreserved characters. For consistency, percent-encoded octets in the ranges of ALPHA (
%41
–%5A
and%61
–%7A
), DIGIT (%30
–%39
), hyphen (%2D
), period (%2E
), underscore (%5F
), or tilde (%7E
) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. Example:
- Decoding percent-encoded octets of unreserved characters. For consistency, percent-encoded octets in the ranges of ALPHA (
→http://www.example.com/%7Eusername/ http://www.example.com/~username/ - Adding trailing / Directories are indicated with a trailing slash and should be included in URLs. Example:
→http://www.example.com http://www.example.com/ - Removing the default port. The default port (port 80 for the “http” scheme) may be removed from (or added to) a URL. Example:
→http://www.example.com:80/bar.html http://www.example.com/bar.html - Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithmAlgorithmIn mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
described in RFC 3986 (or a similar algorithm). Example:
- Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithm
→http://www.example.com/../a/b/../c/./d.html http://www.example.com/a/c/d.html
These normalizations can be applied on URLs without changing the semantics.
Semantic changing normalizations
Applying the following normalizations result in a semantically different URL although it may refer to the same resource:- Removing directory index. Default directory indexesWebserver directory indexWhen an HTTP client requests a URL that points to a directory structure instead of an actual Web page within the directory, the Web server will generally serve a general page, which is often referred to as a main or "index" page....
are generally not needed in URLs. Examples:
→http://www.example.com/default.asp http://www.example.com/
→http://www.example.com/a/index.html http://www.example.com/a/ - Removing the fragment. The fragmentFragment identifierIn computer hypertext, a fragment identifier is a short string of characters that refers to a resource that is subordinate to another, primary resource...
component of a URL is usually removed. Example:
- Removing the fragment. The fragment
→http://www.example.com/bar.html#section1 http://www.example.com/bar.html - Removing IP. Check if the IP addressIP addressAn Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...
is the same as its domain name. Example:
- Removing IP. Check if the IP address
→http://208.77.188.166/ http://www.example.com/ - Limiting protocols. Limiting different application layerApplication layerThe Internet protocol suite and the Open Systems Interconnection model of computer networking each specify a group of protocols and methods identified by the name application layer....
protocols. For example, the “https” scheme could be replaced with “http”. Example:
- Limiting protocols. Limiting different application layer
→https://www.example.com/ http://www.example.com/ - Removing duplicate slashes Paths which include two adjacent slashes should be converted to one. Example:
→http://www.example.com/foo//bar.html http://www.example.com/foo/bar.html - Removing “www” as the first domain label. Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example,
andhttp://example.com/
may access the same website. Although many websites redirect the user to the non-www address (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www equivalent and then normalize all URLs to the non-www prefix. Example:http://www.example.com/
- Removing “www” as the first domain label. Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example,
→http://www.example.com/ http://example.com/ - Sorting the variables of active pages. Some active web pages have more than one variable in the URL. A normalizer can remove all the variables with their data, sort them into alphabetical order (by variable name), and reassemble the URL. Example:
→http://www.example.com/display?lang=en&article=fred http://www.example.com/display?article=fred&lang=en - However, Web servers differ in whether they allow the same variable to appear multiple times, and how this should be represented.
- Removing arbitrary querystring variables. An active page may expect certain variables to appear in the querystring; all unexpected variables should be removed. Example:
→http://www.example.com/display?id=123&fakefoo=fakebar http://www.example.com/display?id=123 - Removing default querystring variables. A default value in the querystring will render identically whether it is there or not. When a default value appears in the querystring, it can be removed. Example:
→http://www.example.com/display?id=&sort=ascending http://www.example.com/display - Removing the "?" when the querystring is empty. When the querystring is empty, there is no need for the "?". Example:
→http://www.example.com/display? http://www.example.com/display - Standardizing character encoding. When the URL contains special characters such as a slash, dot, or space, check to see if the encoded forms such as "%2F" and the unencoded forms such as "/" are the same. Example:
→http://www.example.com/display?category=foo/bar+baz http://www.example.com/display?category=foo%2Fbar%2Bbaz
Normalization based on URL lists
Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URLhttp://foo.org/story?id=xyz
appears in a crawl log several times along with
http://foo.org/story_xyz
we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.