Complement set email filtering
Encyclopedia
Complement Set Filtering (CSF) is a method for filtering unsolicited bulk email
(UBE or spam
) The technique utilizes at least two email
accounts: the primary account where spam and non-spam is received and secondary accounts that receive only spam. CSF calculates the set theoretic difference between the primary and secondary email sets (email accounts) and identifies email messages contained in both sets.
The UBE account is established by creating a mailbox (or alias) incorporating a common first name (to help spammers guess the address) and the domain of the primary account, then exposing the UBE account to the internet. For example, if the primary mailbox is johnm@domain.com, the UBE account might be john@domain.com (see diagram below). After the UBE mailbox is set up, the email address is given to spammers by posting it to message boards, portal groups, “Who Is” listings, ecommerce sites and Usenet.
CSF works especially well in corporate environments where the domain is targeted by spammers and UBE tends to be very similar from mailbox to mailbox. Also, because CSF does not depend on characteristics of past UBE to identify current UBE it is particularly well suited for identifying UBE with new subject matter.
CSF has at least three advantages over Bayesian
and pattern analysis algorithms. First, CSF does not depend on content analysis other than what is required to find similarities between messages in the primary and UBE accounts. Second, CSF does not utilize scoring (word ranking) that can be circumvented with message obfuscating (V!agra instead of Viagra, for example). Third, CSF takes advantage of the fact most UBE contains identical message content, particularly messages targeted at specific corporate domains.
E-mail spam
Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...
(UBE or spam
E-mail spam
Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...
) The technique utilizes at least two email
Email
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
accounts: the primary account where spam and non-spam is received and secondary accounts that receive only spam. CSF calculates the set theoretic difference between the primary and secondary email sets (email accounts) and identifies email messages contained in both sets.
Implementation
CSF is implemented by comparing message content in a UBE account (separate mailbox or alias) with the message content in a primary account. By definition, messages contained in the UBE account are spam so messages in the primary account that are substantially similar to messages in the UBE account are also spam. When the same message is found in both the primary account and the UBE account, it is deleted from the primary account.The UBE account is established by creating a mailbox (or alias) incorporating a common first name (to help spammers guess the address) and the domain of the primary account, then exposing the UBE account to the internet. For example, if the primary mailbox is johnm@domain.com, the UBE account might be john@domain.com (see diagram below). After the UBE mailbox is set up, the email address is given to spammers by posting it to message boards, portal groups, “Who Is” listings, ecommerce sites and Usenet.
CSF works especially well in corporate environments where the domain is targeted by spammers and UBE tends to be very similar from mailbox to mailbox. Also, because CSF does not depend on characteristics of past UBE to identify current UBE it is particularly well suited for identifying UBE with new subject matter.
Advantages of CSF
Many spam-filtering techniques search for patterns and known spam subject matter in the headers and bodies of messages. Others use probabilities (Bayesian statistical methods, for example) to identify unwanted messages. CSF is effective as a stand alone filter or can be combined with other techniques.CSF has at least three advantages over Bayesian
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
and pattern analysis algorithms. First, CSF does not depend on content analysis other than what is required to find similarities between messages in the primary and UBE accounts. Second, CSF does not utilize scoring (word ranking) that can be circumvented with message obfuscating (V!agra instead of Viagra, for example). Third, CSF takes advantage of the fact most UBE contains identical message content, particularly messages targeted at specific corporate domains.