Spider Trap - AbsoluteAstronomy.com

A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler

Web crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambot

Spambot

A spambot is an automated computer program designed to assist in the sending of spam. Spambots usually create fake accounts and send spam using them, although it would be obvious that a spambot is sending it...

s or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages

Dynamic web page

A dynamic web page is a kind of web page that has been prepared with fresh information , for each individual viewing. It is not static because it changes with the time , the user , the user interaction , the context A dynamic web page is a kind of web page that has been prepared with fresh...

with links that continually point to the next day or year.

Common techniques used are:

creation of indefinitely deep directory structures like
```
http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
```
dynamic pages
Dynamic web page
A dynamic web page is a kind of web page that has been prepared with fresh information , for each individual viewing. It is not static because it changes with the time , the user , the user interaction , the context A dynamic web page is a kind of web page that has been prepared with fresh...

like calendars that produce an infinite number of pages for a web crawler to follow.
pages filled with a large number of characters, crashing the lexical analyzer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

the page.
pages with session-id's based on required cookies.

There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.

Politeness

A spider trap causes a web crawler to enter something like an infinite loop

Infinite loop

An infinite loop is a sequence of instructions in a computer program which loops endlessly, either due to the loop having no terminating condition, having one that can never be met, or one that causes the loop to start over...

, which wastes the spider's resources, lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds, meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.

In addition, sites with spider traps usually have a robots.txt

Robots Exclusion Standard

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to...

telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.