Spider Trap
Encyclopedia
A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler
or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambot
s or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages
with links that continually point to the next day or year.
Common techniques used are:
There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.
, which wastes the spider's resources, lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds, meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.
In addition, sites with spider traps usually have a robots.txt
telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambot
Spambot
A spambot is an automated computer program designed to assist in the sending of spam. Spambots usually create fake accounts and send spam using them, although it would be obvious that a spambot is sending it...
s or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages
Dynamic web page
A dynamic web page is a kind of web page that has been prepared with fresh information , for each individual viewing. It is not static because it changes with the time , the user , the user interaction , the context A dynamic web page is a kind of web page that has been prepared with fresh...
with links that continually point to the next day or year.
Common techniques used are:
- creation of indefinitely deep directory structures like
http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
- dynamic pagesDynamic web pageA dynamic web page is a kind of web page that has been prepared with fresh information , for each individual viewing. It is not static because it changes with the time , the user , the user interaction , the context A dynamic web page is a kind of web page that has been prepared with fresh...
like calendars that produce an infinite number of pages for a web crawler to follow. - pages filled with a large number of characters, crashing the lexical analyzerLexical analysisIn computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
parsingParsingIn computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
the page. - pages with session-id's based on required cookies.
There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.
Politeness
A spider trap causes a web crawler to enter something like an infinite loopInfinite loop
An infinite loop is a sequence of instructions in a computer program which loops endlessly, either due to the loop having no terminating condition, having one that can never be met, or one that causes the loop to start over...
, which wastes the spider's resources, lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds, meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.
In addition, sites with spider traps usually have a robots.txt
Robots Exclusion Standard
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to...
telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.