E-mail address harvesting
Encyclopedia
Email harvesting is the process of obtaining lists of email addresses using various methods for use in bulk email
or other purposes usually grouped as spam.
Another common method is the use of special software
known as "harvesting bot
s" or "harvesters", which spider
Web page
s, postings on Usenet
, mailing list archives, internet forum
s and other online sources to obtain email addresses from public data.
Spammers may also use a form of dictionary attack
in order to harvest email addresses, known as a directory harvest attack
, where valid email addresses at a specific domain are found by guessing email address using common usernames in email addresses at that domain. For example, trying alan@example.com
, alana@example.com, alanb@example.com, etc. and any that are accepted for delivery by the recipient email server, instead of rejected, are added to the list of theoretically valid email addresses for that domain.
Another method of email address harvesting is to offer a product or service free of charge as long as the user provides a valid email address, and then use the addresses collected from users as spam targets. Common products and services offered are jokes of the day, daily bible quotes, news or stock alerts, free merchandise, or even registered sex offender alerts for one's area. Another technique was used in late 2007 by the company iDate, which used email harvesting directed at subscribers to the Quechup
website to spam the victim's friends and contacts.
Spam differs from other forms of direct marketing
in many ways, one of them being that it costs little more to send to a larger number of recipients than a smaller number. For this reason, there is little pressure upon spammers to limit the number of addresses targeted in a spam run, or to restrict it to persons likely to be interested. One consequence of this fact is that many people receive spam written in languages they cannot read — a good deal of spam sent to English-speaking recipients is in Chinese
or Korean
, for instance. Likewise, lists of addresses sold for use in spam frequently contain malformed addresses, duplicate addresses, and addresses of role accounts such as postmaster.
Spammers may harvest email addresses from a number of sources. A popular method uses email addresses which their owners have published for other purposes. Usenet
posts, especially those in archives such as Google Groups
, frequently yield addresses. Simply searching the Web for pages with addresses — such as corporate staff directories or membership lists of professional societies — using spambot
s can yield thousands of addresses, most of them deliverable. Spammers have also subscribed to discussion mailing list
s for the purpose of gathering the addresses of posters. The DNS
and WHOIS
systems require the publication of technical contact information for all Internet domains; spammers have illegally trawled these resources for email addresses. Spammers have also concluded that generally, for the domain names of businesses, all of the email addresses will follow the same basic pattern and thus are able to accurately guess the email addresses of employees whose addresses they have not harvested. Many spammers use programs called web spiders to find email addresses on web pages. Usenet article message-IDs often look enough like email addresses that they are harvested as well. Spammers have also harvested email addresses directly from Google search results, without actually spidering the websites found in the search.
Spammer viruses may include a function which scans the victimized computer's disk drives (and possibly its network interfaces) for email addresses. These scanners discover email addresses which have never been exposed on the Web or in Whois. A compromised computer located on a shared network segment
may capture email addresses from traffic addressed to its network neighbors. The harvested addresses are then returned to the spammer through the bot-net created by the virus.
A recent, controversial tactic, called "e-pending", involves the appending of email addresses to direct-marketing databases. Direct marketers normally obtain lists of prospects from sources such as magazine
subscriptions and customer lists. By searching the Web and other resources for email addresses corresponding to the names and street addresses in their records, direct marketers can send targeted spam email. However, as with most spammer "targeting", this is imprecise; users have reported, for instance, receiving solicitations to mortgage
their house at a specific street address — with the address being clearly a business address including mail stop and office number.
Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a hidden Web bug
in a spam message written in HTML
may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site. Users can defend against such abuses by turning off their mail program's option to display images, or by reading email as plain-text rather than formatted.
Likewise, spammers sometimes operate Web pages which purport to remove submitted addresses from spam lists. In several cases, these have been found to subscribe the entered addresses to receive more spam.
When persons fill out a form it is often sold to a spammer using a web service or http post to transfer the data. This is immediate and will drop the email in various spammer databases. The revenue made from the spammer is shared with the source. For instance if someone applies online for a mortgage, the owner of this site may have made a deal with a spammer to sell the address. These are considered the best emails by spammers, because they are fresh and the user has just signed up for a product or service that often is marketed by spam.
In The United States of America, the CAN-SPAM Act of 2003
made it illegal to initiate email to a recipient where the electronic mail address of the recipient was obtained:
Furthermore, website operators may not distribute their legitimately collected lists. The CAN-SPAM Act of 2003 requires operators of web sites and online services should include a notice that the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.
—e.g., changing "bob@example.com" to "bob at example dot com"—is a common technique to make harvesting email addresses more difficult. Though relatively easy to overcome—see, e.g., this Google search—it is still effective. It is somewhat inconvenient to users, who must examine the address and manually correct it.
Images : Using images to display part or all of an email address is a very effective harvesting countermeasure. The processing required to automatically extract text from images is not economically viable for spammers. It is very inconvenient for users, who must manually launch their email client and transcribe the address.
Contact forms : Email contact forms
which send an email but do not reveal the recipient's address avoid publishing an email address in the first place. Insecure forms, however, may actually aid spammers by effectively serving as an open mail relay
. This method prevents users from composing in their preferred client and limits message content to plain text.
JavaScript obfuscation : JavaScript
email obfuscation
produces a normal, clickable email link for users while obscuring the address from spiders. In the source code seen by harvesters, the email address is scrambled, encoded, or otherwise obfuscated. In practice, a simple ROT13
encoding has been found to be very effective. This method is very convenient for most users; however, it does reduce accessibility, e.g. for text-based browsers and screen readers. For users with a JavaScript-enabled browser, this solution is entirely transparent.
HTML obfuscation : In HTML, email addresses may be obfuscated in many ways, such as inserting hidden elements within the address or listing parts out of order and using CSS to restore the correct order. Each has the benefit of being transparent to most users, but none support clickable email links and none are accessible to text-based browsers and screen readers.
CAPTCHA : Requiring users to complete a CAPTCHA
before giving out an email address is an effective harvesting countermeasure. A popular solution is the reCAPTCHA
Mailhide service.
CAN-SPAM Notice : To enable prosecution of spammers under the CAN-SPAM Act of 2003, a website operator must post a notice that "the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages."
Mail Server Monitoring : A method that can be implemented at the recipient email server for combatting directory harvesting attacks is to reject all email addresses as invalid from any sender that has specified more than one invalid recipient address; however, this carries a risk of legitimate email being blocked too.
Spider Traps : A spider trap
is a part of a website which is a honeypot
designed to combat email harvesting spiders. Well-behaved spiders are unaffected, as the website's robots.txt file will warn spiders to stay away from that area—a warning that malicious spiders do not heed. Some traps block access from the client's IP as soon as the trap is accessed. Others, like a network tarpit, are designed to waste the time and resources of malicious spiders by slowly and endlessly feeding the spider useless information. The "bait" content may contain large numbers of fake addresses, a technique known as list poisoning, though some consider this practice harmful.
Distribution list
Distribution list is a term sometimes used for a function of email clients where list of email addresses is used to email everyone on the list at once. This can be referred to as an electronic mailshot. It differs from a mailing list, electronic mailing list or the email option found in an Internet...
or other purposes usually grouped as spam.
Methods
The simplest method involves spammers purchasing or trading lists of email addresses from other spammers.Another common method is the use of special software
Computer software
Computer software, or just software, is a collection of computer programs and related data that provide the instructions for telling a computer what to do and how to do it....
known as "harvesting bot
Bot
Bot or BOT may refer to:-Computing:* Bot, another also name for a Web crawler* Bots , an open-source EDI software* BOTS, a computer game* Internet bot, a computer program that does automated tasks...
s" or "harvesters", which spider
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
Web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
s, postings on Usenet
Usenet
Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980...
, mailing list archives, internet forum
Internet forum
An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are at least temporarily archived...
s and other online sources to obtain email addresses from public data.
Spammers may also use a form of dictionary attack
Dictionary attack
In cryptanalysis and computer security, a dictionary attack is a technique for defeating a cipher or authentication mechanism by trying to determine its decryption key or passphrase by searching likely possibilities.-Technique:...
in order to harvest email addresses, known as a directory harvest attack
Directory Harvest Attack
A Directory Harvest Attack or DHA is a technique used by spammers in an attempt to find valid/existent e-mail addresses at a domain by using brute force. The attack is usually carried out by way of a standard dictionary attack, where valid e-mail addresses are found by brute force guessing valid...
, where valid email addresses at a specific domain are found by guessing email address using common usernames in email addresses at that domain. For example, trying alan@example.com
Example.com
Example.com, example.net, example.org, and example.edu are second-level domain names reserved for documentation purposes and examples of the use of domain names....
, alana@example.com, alanb@example.com, etc. and any that are accepted for delivery by the recipient email server, instead of rejected, are added to the list of theoretically valid email addresses for that domain.
Another method of email address harvesting is to offer a product or service free of charge as long as the user provides a valid email address, and then use the addresses collected from users as spam targets. Common products and services offered are jokes of the day, daily bible quotes, news or stock alerts, free merchandise, or even registered sex offender alerts for one's area. Another technique was used in late 2007 by the company iDate, which used email harvesting directed at subscribers to the Quechup
Quechup
Quechup is a social networking website that came to prominence in 2007 when it used automatic email invitations for viral marketing to all the e-mail addresses in its members address books...
website to spam the victim's friends and contacts.
Spam differs from other forms of direct marketing
Direct marketing
Direct marketing is a channel-agnostic form of advertising that allows businesses and nonprofits to communicate straight to the customer, with advertising techniques such as mobile messaging, email, interactive consumer websites, online display ads, fliers, catalog distribution, promotional...
in many ways, one of them being that it costs little more to send to a larger number of recipients than a smaller number. For this reason, there is little pressure upon spammers to limit the number of addresses targeted in a spam run, or to restrict it to persons likely to be interested. One consequence of this fact is that many people receive spam written in languages they cannot read — a good deal of spam sent to English-speaking recipients is in Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
or Korean
Korean language
Korean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
, for instance. Likewise, lists of addresses sold for use in spam frequently contain malformed addresses, duplicate addresses, and addresses of role accounts such as postmaster.
Spammers may harvest email addresses from a number of sources. A popular method uses email addresses which their owners have published for other purposes. Usenet
Usenet
Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980...
posts, especially those in archives such as Google Groups
Google Groups
Google Groups is a service from Google Inc. that supports discussion groups, including many Usenet newsgroups, based on common interests. The service was started in 1995 as Deja News, and was transitioned to Google Groups after a February 2001 buyout....
, frequently yield addresses. Simply searching the Web for pages with addresses — such as corporate staff directories or membership lists of professional societies — using spambot
Spambot
A spambot is an automated computer program designed to assist in the sending of spam. Spambots usually create fake accounts and send spam using them, although it would be obvious that a spambot is sending it...
s can yield thousands of addresses, most of them deliverable. Spammers have also subscribed to discussion mailing list
Mailing list
A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is referred to as "the mailing list", or simply "the...
s for the purpose of gathering the addresses of posters. The DNS
Domain name system
The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...
and WHOIS
WHOIS
WHOIS is a query and response protocol that is widely used for querying databases that store the registered users or assignees of an Internet resource, such as a domain name, an IP address block, or an autonomous system, but is also used for a wider range of other information. The protocol stores...
systems require the publication of technical contact information for all Internet domains; spammers have illegally trawled these resources for email addresses. Spammers have also concluded that generally, for the domain names of businesses, all of the email addresses will follow the same basic pattern and thus are able to accurately guess the email addresses of employees whose addresses they have not harvested. Many spammers use programs called web spiders to find email addresses on web pages. Usenet article message-IDs often look enough like email addresses that they are harvested as well. Spammers have also harvested email addresses directly from Google search results, without actually spidering the websites found in the search.
Spammer viruses may include a function which scans the victimized computer's disk drives (and possibly its network interfaces) for email addresses. These scanners discover email addresses which have never been exposed on the Web or in Whois. A compromised computer located on a shared network segment
Network segment
A network segment is a portion of a computer network. The nature and extent of a segment depends on the nature of the network and the device or devices used to interconnect end stations.-Ethernet:...
may capture email addresses from traffic addressed to its network neighbors. The harvested addresses are then returned to the spammer through the bot-net created by the virus.
A recent, controversial tactic, called "e-pending", involves the appending of email addresses to direct-marketing databases. Direct marketers normally obtain lists of prospects from sources such as magazine
Magazine
Magazines, periodicals, glossies or serials are publications, generally published on a regular schedule, containing a variety of articles. They are generally financed by advertising, by a purchase price, by pre-paid magazine subscriptions, or all three...
subscriptions and customer lists. By searching the Web and other resources for email addresses corresponding to the names and street addresses in their records, direct marketers can send targeted spam email. However, as with most spammer "targeting", this is imprecise; users have reported, for instance, receiving solicitations to mortgage
Mortgage loan
A mortgage loan is a loan secured by real property through the use of a mortgage note which evidences the existence of the loan and the encumbrance of that realty through the granting of a mortgage which secures the loan...
their house at a specific street address — with the address being clearly a business address including mail stop and office number.
Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a hidden Web bug
Web bug
A web bug is an object that is embedded in a web page or e-mail and is usually invisible to the user but allows checking that a user has viewed the page or e-mail. One common use is in e-mail tracking. Alternative names are web beacon, tracking bug, and tag or page tag...
in a spam message written in HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site. Users can defend against such abuses by turning off their mail program's option to display images, or by reading email as plain-text rather than formatted.
Likewise, spammers sometimes operate Web pages which purport to remove submitted addresses from spam lists. In several cases, these have been found to subscribe the entered addresses to receive more spam.
When persons fill out a form it is often sold to a spammer using a web service or http post to transfer the data. This is immediate and will drop the email in various spammer databases. The revenue made from the spammer is shared with the source. For instance if someone applies online for a mortgage, the owner of this site may have made a deal with a spammer to sell the address. These are considered the best emails by spammers, because they are fresh and the user has just signed up for a product or service that often is marketed by spam.
Legality
In Australia, the creation or use of email-address harvesting programs (address harvesting software) is illegal according to the 2003 anti-spam legislation only if you intend to use the email-address harvesting programs to send unsolicited commercial email. The legislation is intended to prohibit emails with 'an Australian connection' - spam originating in Australia being sent elsewhere, and spam being sent to an Australian address.In The United States of America, the CAN-SPAM Act of 2003
CAN-SPAM Act of 2003
The CAN-SPAM Act of 2003 , signed into law by President George W. Bush on December 16, 2003, establishes the United States' first national standards for the sending of commercial e-mail and requires the Federal Trade Commission to enforce its provisions...
made it illegal to initiate email to a recipient where the electronic mail address of the recipient was obtained:
- Using an automated means that generates possible electronic mail addresses by combining names, letters, or numbers into numerous permutations.
- Using an automated means to extract electronic mail addresses from an Internet website or proprietary online service operated by another person, and such website or online service included, at the time the address was obtained, a notice stating that the operator of such website or online service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.
Furthermore, website operators may not distribute their legitimately collected lists. The CAN-SPAM Act of 2003 requires operators of web sites and online services should include a notice that the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.
Anti-harvesting methods
Address munging : Address mungingAddress munging
Address munging is the practice of disguising, or munging, an e-mail address to prevent it being automatically collected and used as a target for people and organizations who send unsolicited bulk e-mail...
—e.g., changing "bob@example.com" to "bob at example dot com"—is a common technique to make harvesting email addresses more difficult. Though relatively easy to overcome—see, e.g., this Google search—it is still effective. It is somewhat inconvenient to users, who must examine the address and manually correct it.
Images : Using images to display part or all of an email address is a very effective harvesting countermeasure. The processing required to automatically extract text from images is not economically viable for spammers. It is very inconvenient for users, who must manually launch their email client and transcribe the address.
Contact forms : Email contact forms
Form (web)
A webform on a web page allows a user to enter data that is sent to a server for processing. Webforms resemble paper or database forms because internet users fill out the forms using checkboxes, radio buttons, or text fields...
which send an email but do not reveal the recipient's address avoid publishing an email address in the first place. Insecure forms, however, may actually aid spammers by effectively serving as an open mail relay
Open mail relay
An open mail relay is an SMTP server configured in such a way that it allows anyone on the Internet to send e-mail through it, not just mail destined to or originating from known users...
. This method prevents users from composing in their preferred client and limits message content to plain text.
JavaScript obfuscation : JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
email obfuscation
Obfuscated code
Obfuscated code is source or machine code that has been made difficult to understand for humans. Programmers may deliberately obfuscate code to conceal its purpose or its logic to prevent tampering, deter reverse engineering, or as a puzzle or recreational challenge for someone reading the source...
produces a normal, clickable email link for users while obscuring the address from spiders. In the source code seen by harvesters, the email address is scrambled, encoded, or otherwise obfuscated. In practice, a simple ROT13
ROT13
ROT13 is a simple substitution cipher used in online forums as a means of hiding spoilers, punchlines, puzzle solutions, and offensive materials from the casual glance. ROT13 has been described as the "Usenet equivalent of a magazine printing the answer to a quiz upside down"...
encoding has been found to be very effective. This method is very convenient for most users; however, it does reduce accessibility, e.g. for text-based browsers and screen readers. For users with a JavaScript-enabled browser, this solution is entirely transparent.
HTML obfuscation : In HTML, email addresses may be obfuscated in many ways, such as inserting hidden elements within the address or listing parts out of order and using CSS to restore the correct order. Each has the benefit of being transparent to most users, but none support clickable email links and none are accessible to text-based browsers and screen readers.
CAPTCHA : Requiring users to complete a CAPTCHA
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...
before giving out an email address is an effective harvesting countermeasure. A popular solution is the reCAPTCHA
ReCAPTCHA
reCAPTCHA is a system originally developed at Carnegie Mellon University's main Pittsburgh campus. It uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas. On September 16, 2009, Google acquired reCAPTCHA. reCAPTCHA is currently...
Mailhide service.
CAN-SPAM Notice : To enable prosecution of spammers under the CAN-SPAM Act of 2003, a website operator must post a notice that "the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages."
Mail Server Monitoring : A method that can be implemented at the recipient email server for combatting directory harvesting attacks is to reject all email addresses as invalid from any sender that has specified more than one invalid recipient address; however, this carries a risk of legitimate email being blocked too.
Spider Traps : A spider trap
Spider Trap
A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived...
is a part of a website which is a honeypot
Honeypot (computing)
In computer terminology, a honeypot is a trap set to detect, deflect, or in some manner counteract attempts at unauthorized use of information systems...
designed to combat email harvesting spiders. Well-behaved spiders are unaffected, as the website's robots.txt file will warn spiders to stay away from that area—a warning that malicious spiders do not heed. Some traps block access from the client's IP as soon as the trap is accessed. Others, like a network tarpit, are designed to waste the time and resources of malicious spiders by slowly and endlessly feeding the spider useless information. The "bait" content may contain large numbers of fake addresses, a technique known as list poisoning, though some consider this practice harmful.
See also
- Anti-spam techniques
- Email spam
- Web crawlerWeb crawlerA Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
- Web scrapingWeb scrapingWeb scraping is a computer software technique of extracting information from websites...
- Web harvestingWeb harvestingWeb harvesting is commonly used to describe Web scraping from a multitude of sites. It also refers to an implementation of a Web crawler that uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge...
External links
- Spam laws
- Spamwise.org anti-harvesting tools and info
- ZDNet report on harvesting from Twitter
- How to deal with email harvesters