Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions Recent Developments in Bayesian Approach in Filtering Junk E-mail
Department of Electrical and Computer Engineering,
Abstract
standardized definition for spam, however in
generally the word “spam” is used to refer to unwanted “garbage” e-mail massages. Spam
Junk mail is one of the main problems in
constitutes a major problem for both e-mail
Internet. There are several methods for the
users and Internet Service Providers (ISP). The
automated construction of filters to eliminate such unwanted messages from user’s mail system. This paper is mainly concerned about
As a result of this growing situation some
the Bayesian filtering method and its different
automated and manual methods for filtering
types of applications in junk e-mail filtering.
such a junk from legitimate e-mail are needed.
Bayesian technique is trained automatically to
Many of junk mail filtering products are
available, [4] which allow users to handcraft a
implementations that use Bayesian techniques
set of logical rules to filter junk mail. The
are available as software. Any user can apply this software in different layers of client side
construction of rule sets to detect junk mail. It
or server side. But spammers are now trying to
points out the need for adaptive methods for
defeat Bayesian filters by including random
dealing with this problem. The automated
dictionary words and/or short stories in their
learning rules to classify e-mail are introduced
messages. The Bayesian filter can be
in [4]. While such approaches have shown
moderated to block the new spammer
some success for general classification tasks
techniques. The efficiency of the Baysian filter
based on the text of message, the average
is greater than the other e-mail filters. If any
number of spam messages received continues
one wants to filter spam out of email, it is
to increase exponentially. Figure 01 shows
strongly recommended not to automatically delete messages. The same is true for your real
messages received by one e-mail user and
email; instead of deleting it, move it to another folder. That way, you'll build a collection of spam and non-spam messages, which will come in handy for training filters. Key words Bayesian, spam, email filtering a p 150000 r o e 100000 1. Introduction
The number of users connected to the Internet
is increasing daily. The electronic mail (E-
mail) is quickly becoming the fastest and most
economical feature in the Internet users. Every e-mail user can function his/her mail account
Figure 01: Annual Spam Evolutions
and mailboxes as he/she needs. Unfortunately
The spam cost to the ISP can be seen at two
some virtues that have made e-mail popular
levels; increase of load of e-mail servers and
have also enticed flooding of unwanted e-mail.
waste of bandwidth. The slower Internet access
With the proliferation of direct markets on the
is arising according to the bandwidth of the
Internet and the increased availability of
enormous e-mail address mailing lists the
Spam filtering can be applied at the client level
volume of junk mail has grown widely in past
techniques for filtering spam were proposed in
the literature. They are based on the header
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
analysis, address lists, key word lists, digital
signatures and content statistical analysis
address to get around the recipient’s defenses.
2.4. Rule based filtering
solution in the spam filter, which constitutes
Rule based filters assign a spam score to each
software. Generally Baysian technique is used
e-mail based on whether the e-mail contains
in conjunction with the other techniques in the
features typical of spam massages, such as
spam filtering process. They can be applied as
several layers. The Baysian filter divides
manually the corps of a high number of e-mail messages into two classes; legitimate (ham) or
2.5. Heuristic filtering 2. Anti Spam Technologies
intelligence to deliver an automated spam
Due to the huge increase of spam in the past
years the researches pay more attention for
filtering spam. Many researches are presently
known spam patterns. Main advantage is that
working in the implementation of new filters
no human actions are needed for filtering here.
destination either by blocking it at the sever level or the client level. In January 2003 and
2.6. Receipt-time filtering
2004, a conference on spam took place at MIT
Unsolicited Commercial E-mail (CAUCE) [8]
SMTP connection, it can use a wide variety of
was established. While CAUCE is trying to
techniques to detect and reject spam. An
effective heuristic test is to see if the incoming
connection has valid reverse DNS (rDNS),
groups and companies are trying to fight/block
giving the sending host’s domain name as well
as IP address. While there’s no technical
requirement that all sending hosts have rDNS, many people have noted that most hosts
2.1. Centralized filtering server
AOL started rejecting all mail from hosts without rDNS, which impel the few legitimate
This architecture used a single anti-spam filter
senders without working rDNS to get theirs in
that runs on centralized organization-wide mail
2.2. Gateway filtering 2.7. Content filtering
All inbound e-mail is routed through a filtering
Once the SMTP server has decided to accept a
gateway before being delivered to the mail
message, the sender transfers the entire set of
server. Gateway services work well with web
message headers and the message body. (For
SMTP purposes, the message headers are just
part of the message, and do not affect message
2.3. List-based filtering
delivery.) Many filtering schemes work on the header and body.
This method is richer than the other methods, also it is different to the other methods. This
2.8. Hybrid filtering
method is operating at the server level. Today,
black-listing and white-listing are ineffective,
While all of the filtering techniques above can
although server-based solutions adopt them as
be somewhat effective, a combination of many
an auxiliary technique often to be integrated
resources have become less effective since
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
Many spam filters can be applied as a series.
Ho represents a hypothesis, called a null
Typically a mail server will use DNS based
blacklists (DNSBLs) to reject some mail, then
use body filters on the mail that makes it pass
P(Ho) is called the prior probability of Ho
Some other available filtering methods are:
probability of seeing the evidence E given
that the hypothesis Ho is true. It is also
methods, Content-based filtering, Checksum-
called the likelihood function when it is
based filtering, Sender-supported whitelists
P(E) is called the marginal probability of
3. The Bayesian Method
P(Ho|E) is called the posterior probability
Bayesian filtering is a statistical approach,
which was used by many researchers to build a
The factor P(E|H0)/P(E) represents the impact
spam filter. Paul Graham made a significant
that the evidence has on the belief in the
contribution to this domain in implementing
hypothesis. If it is likely that the evidence will
and testing one of the first Bayes spam. [7]
consideration is true, then this factor will be
improvement to this filter. He produced a
large. Multiplying the prior probability of the
number of alternative approaches to combining
hypothesis by this factor would result in a
large posterior probability of the hypothesis
architecture of the spam Bayesian system has
given the evidence. Under Bayesian inference,
three different parts: Tokenizing, combining
much new evidence should alter a belief in a
3.1. Mathematical explanation
Bayesian e-mail filters take advantage of
probability that is greater than 1. Since P(E) is
Bayesian theorem. Bayesian theorem, in the
context of spam, says that the probability that an e-mail is spam, given that it has certain
words in it, is equal to the probability of
finding those certain words in spam e-mail,
yield a posterior probability of 1. Therefore the
times the probability that any e-mail is spam,
posterior probability could yield a probability
divided by the probability of finding those
greater than 1 only if P(E) were less than
Interface of the Baysian uses numerical estimate of the degree of belief in a hypothesis
before evidence has been observed. Baysian
Alternative Bayes’
interface usually relies on degree of belief or
subjective probabilities. Baysian theorem adjusts probabilities given new evidence in the
Bayes’ theorem is often blown up by nothing
proportional to the prior probability times the
(Often called “not Ho”). The theorem can be
likelihood. In addition the ratio, Pr(E|Ho)
Pr(B) is sometimes called the standardized
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
The Bayes’ theorem in terms of odds and
mail in its database. For instance, Bayesian
likelihood ratio can be explained as follows:
spam filters will typically have learned a very high spam probability for the words “Viagra”
and “refinance”, but a very low spam
Bayes’ theorem can also be written neatly in
probability for words seen only in legitimate e-
terms of a likelihood ratio Λ and odds O as
mail, such as the names of friends and family members.
After training, the word probabilities (also
known as likelihood functions) are used to compute the probability that an e-mail with a
particular set of words in it belongs to which
category. Each word in the e-mail contributes
to the e-mail’s spam probability. This
contribution is called the posterior probability
and is computed using Bayes’ theorem. Then, the e-mail’s spam probability is computed over
all words in the e-mail, and if the total exceeds
a certain threshold (say 95%), the filter will
mark the e-mail as spam. E-mail marked as
spam can then be automatically moved to a “Junk” e-mail folder, or even deleted outright.
4. Spam filtering mechanism of Baysian technology
There are several algorithms that use various modifications of Bayesian technique.
3.3. Process
This section mainly explains the Baysian
theorem used in junk mail filtering. According
to the section 3.1 and 3.2 we can summarize
the Bays’ theorem in junk mail detection as
appearing in a spam/garbage message, and
Use those odds as input to Bayes’ Formula to determine if the message is garbage or
The first thing needed to do is to teach the
Using Bayesian analysis to classify spam and
Bayesian filter the difference between garbage
non-spam was suggested by Paul Graham. A
and non-garbage messages. We can identify
Bayesian filter takes each word in a message
the spam or garbage e-mails from the content
and looks it up in a database to see how many
of e-mails. Most of spam e-mails contain
times that word has appeared in prior spam and
certain key words. Instinctively, people know
that a message containing these words or
then lets it combine those counts into an
phrases is garbage/spam because of their
overall probability estimate to check whether
The Bayesian filter does not have the benefit
Particular words have particular probabilities
of our years of experience, so we have to teach
of occurring in spam e-mail and in legitimate
it what spam/garbage messages look like, and
e-mail. For instance, most e-mail users will
how they differ from non-garbage messages.
frequently encounter the word Viagra in spam
The filter needs to introduce the garbage mail
e-mail, but will seldom see it in other e-mail.
The filter does not know these probabilities in
message to the filter, it finds every word in the
advance, and must first be trained so that it can
message and stores it (along with how many
build them up. To train the filter, the user must
manually indicate whether a new e-mail is
spam or not. For all words in each training e-
Separate databases are kept for garbage and
mail, the filter will adjust the probabilities that
non-garbage mail messages. The filter uses a
each word will appear in spam or legitimate e-
looser definition of a word than humans do – a
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
word (more properly called a token) can also
article about the relative value of the Euro is
be an IP address, a host name, an HTML tag,
going to confuse the audience so much they
or a price (such as “Rs100”). However a token
cannot be random strings, words less than
The filter scans through the message, creating
a list of every word it knows about (in other
words, every word in the message that’s also in
the token databases). In this example, the
words it knows about are “prescription”,
“when”, “today”, “visit”, and “your”. Once
the filter has the list of words it knows about,
for each word it calculates the probability that
Figure 02: Creating a word database for the
This probability value assigned to each word is
commonly referred to as spamicity, and ranges
from 0.0 to 1.0. A spamicity value greater
Before mail can be filtered using this method,
than 0.5 means that a message containing a
the user needs to generate a database with
particular word is likely to be spam/garbage,
words and tokens collected from a sample of
while a spamicity value less than 0.5 indicates
spam/garbage mail and valid mail (referred to
that a message containing that word is likely to
value of 0.5 is neutral, meaning that it has no
A probability value is then assigned to each
effect on the decision as to whether a message
word or token; the probability is based on
calculations that take into account how often that word occurs in spam/garbage as opposed
to legitimate mail (ham). This is done by
spam/garbage mail senders (spammers) are
analyzing the users’ outbound mail and by
spam/garbage words in their spam/garbage
When we create the ham database the Baysian
method does not require an initial learning period, it has 2 major flaws:
The first (and obvious) goal was to see if the spammer could sneak the message past
the Bayesian filter by including obviously
The ham data file is publicly available and
and therefore bypassed. If the ham data file
The secondary goal was to try to get the filter to start recognizing the words
“congresswoman” and “soybean” as
can get the filter to assign a high spamicity
Such a ham data file is a general one, and
This circumvention method also has limited
Besides ham mail, the Bayesian filter also
relies on a spam/garbage data file. This
spam/garbage data file must include a large
senders (spammers) start including a large pile
sample of known spam/garbage and must be
of legitimate text (such as an article from the
software. This will ensure that the Bayesian
filter is aware of the latest spam/garbage tricks,
growth supplements that also contains an
resulting in a high spam/garbage detection rate
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
(note: this is achieved once the required initial
point where it gives a lot of false positives
When actual filtering is working, once the ham
5.4. Bayesian Mail Filter (BMF)
BMF is another option. It is very small - only
calculated and the filter is ready for use. When
4600 lines of code, 110 KB. It is quite fast. In
a new mail arrives, it is broken down into
addition to SourceForge any person can find it
words and the most relevant words – i.e., those
that are most significant in identifying whether the mail is spam/garbage or not – are singled
out. From these words, the Bayesian filter
5.5. iFile
calculates the probability of the new message
being spam/garbage or not. If the probability is
ifile collects statistics on the occurrences of
greater than a threshold, say 0.9, then the
message is classified as spam/garbage. This
filed/refiled, and uses that to determine a “best
Bayesian approach to spam/garbage is highly
guess” of where new mail should best be filed.
effective – a May 2003 BBC article reported
Some researches have done quite a lot of work
that spam/garbage detection rates of over
to tune it to provide decent performance.
The idea is to collect a dictionary of statistics
on the number of occurrences of words in
5. Applications in E-mail filtering with Baysian technology
messages are compared to the dictionary, and are filed to the folder to which they have the
The Bayesian technique is widely used in
many technological areas. Image processing
[12], Microscopic image analyzing, medical
misclassified by the filter), the dictionary is
research [13], Detecting Speech Recognition
revised [23]. Words that are not commonly
Errors [14] are the mostly used areas in this
used are eliminated, so that the dictionary does
technique. This review paper is trying to
concern the Baysian technique in the junk mail
ifile uses naive Bayesian filtering as a
statistical approach to direct messages to MH
5.1. SpamAssassin
folders to which they have the highest degree
SpamAssassin is a rules-based filter written in
The “naive” assumption that is made is that
Perl. It was used for a while but spammers
correlations need only be done on the basis of
rapidly figured out how to get around each
individual word occurrences, that is, we count
new rule. So it was becoming less and less
the number of times that the word “stop” is
effective [15]. In version 2.5 the developers
used, and do not consider combinations of
words (e.g. “stop it” or “stop and go”).
problem. Besides, since it is still in Perl, it is
The weakness of this scheme is that while it is
5.2. Bogofilter
classifying dissimilar documents into different
Bogofilter was one of the first Bayesian filters.
groups, it has nothing that makes it good at
Originally by über-hacker Eric S. Raymond,
it’s written in good old-fashioned C and runs
fast [16]. If it has a weakness, it is being little
too conservative about rating things as spam.
5.6. dbacl
5.3. Quick Spam Filter (QSF)
A digramic Bayesian filter, is not restricted to
just spam and non-spam. This mail filter will
QSF is a more recent Bayesian filter. It is also
classify a message into one of many categories
bogofilter. The scores it generates seem to
skew somewhat higher than bogofilter’s, to the
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
Another junk mail filters are also available like
mail senders (spammers) started using “f-r-e-
SPASTIC, SpamProbe. These all are based on
e” instead of “free” they succeeded in evading
keyword checking until “f-r-e-e” was also
included in the keyword database. On the other
6. The need of Bayesian technique
hand, the Bayesian filter automatically notices
such tactics; in fact if the word “f-r-e-e” is
There are some experiment results of Baysian
found, it is an even better spam/garbage
filters and non-Baysian filters. Each program
indicator, since it is unlikely to occur in a ham
was installed according to its documentation.
For the filters that required training, the
training set data was supplied. Each filter was
The Bayesian technique is sensitive to the user
taken in turn and executed once for each e-
mail in the spam and legitimate sets and the
company/organization and understands that,
for example, the word ‘mortgage’ might
The standard metrics for text classification are
company/organization running the filter is, say,
recall and precision. Spam classified as non-
a car dealership, whereas it would not indicate
spam is known as a false negative. Non-spam
it as spam/garbage if the company/organization
classified as spam is known as a false positive.
Precision is the percentage of messages that
were classified as spam that actually are spam.
The Bayesian method is multi-lingual and
High precision is essential to prevent the
international - A Bayesian anti-spam/garbage
messages we want to read being classified as
filter, being adaptive, can be used for any
spam. A low precision indicates that there are
language required. Most keyword lists are
many false negatives. Recall is the percentage
available in English only and are therefore
of actual spam messages that were classified as
quite useless in non English-speaking regions.
spam messages. High recall is necessary in
order to prevent our inbox filling with spam. A
A Bayesian filter is difficult to fool, as
low recall indicates that there are many false
opposed to a keyword filter - An advanced
positives. According to the experiment by
spammer who wants to trick a Bayesian filter
using the several types of Bayesian mail filters
indicate spam/garbage (such as free, Viagra,
etc), or more words that generally indicate
valid mail (such as a valid contact name, etc).
features against the RIPPER rule-learning
7. Conclusion
algorithm in different e-mail classification
tasks. In learning a user’s foldering preferences
The Bayesian filters, after training, offer better
and learning to detect spam, the Bayesian filter
recall than the heuristic filters. Catching a
higher proportion of spam is clearly good,
since that is the reason people use them. With
insufficient training, however, the Bayesian
filters perform poorly in comparison with
6.1. Advantages of Bayesian filter
SpamAssassin in terms of recall. But some of
Bayesian filters work very poorly (Quick
The Bayesian method takes the whole message
Spam) compared with other Bayesian filters.
into account. It recognizes keywords that
identify spam/garbage, but it also recognizes
The Bayesian filter outperforms by far the
keyword-based filter, even with very small
filtering is a much more intelligent approach
because it examines all aspects of a message,
as opposed to keyword checking that classifies
As future work, we can plan to implement
a mail as spam/garbage on the basis of a single
alternative anti-spam filters, based on other
machine learning algorithms. Also the filter
can include the foddering mechanism to check
A Bayesian filter is constantly self-adapting -
the spam if user need it. The server side mail
By learning from new spam/garbage and new
filters are most effective, because many users
valid outbound mails, the Bayesian filter
can be protected from the filters. More than
one filter can be activated in a server as a set of
techniques. For example, when spam/garbage
layers, and it is more effective than one filter
Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions
because if one filter fails another one can be
[15]. Lina Zhou1, Jinjuan Feng1, Andrew Sears1,
Yongmei Shi2. Applying the Naïve Bayes Classifier to Assist Users in Detecting Speech
8. References
Recognition Errors. 1Information Systems
[1]. L. Pelletier, J. Almhana, V. Choulakian,
2Computer Science and Electrical Engineering
Adaptive Filtering of SPAM, Proceedings of
[16]. SpamAssassin, http://spamassassin.apache.org/
[2]. Mingjun Lan, Wanlei Zhou, Spam Filtering
[17]. Bogofilter, http://bogofilter.sourceforge.net/
based on Preference Ranking, Proceedings of
the 2005 The Fifth International Conference on
[3]. M Sahami, S. Dumais, D. Heckerman, and E.
Horvitz, A Bayesian approach to filtering junk
e-mail, AAAI’98 Workshop. Learning/or Text
[20]. dbacl, http://www.lbreyer.com/gpl.html
Categorization, Madison, WI, July 27, 1998.
[4]. W. W. Cohen, Learning rules that classify e-
http://spastic.sourceforge.net/index.html
mail, In Proceedings of the 1996 AAAI Spring
[22]. SpamProbe, http://spamprobe.sourceforge.net/
[24]. Jason D. M. Rennie, ifile: An Application of
[6]. Paul Graham, Better Bayesian filtering,
Artificial Intelligence Lab, Massachusetts
[8]. Paul Graham, A Plan for Spam, August 2002,
[9]. The Coalition Against Unsolicted Commercial
experimental comparison of naive bayesian and
personal e-mail messages. In Proceedings of
[10]. X. Carreras and L. Andm, Boosting trees for
anti-spam e-mail filtering, In Proceedings of
RANLP-2001, 4th International Conference on
[27]. Jefferson Provost, Naïve-Bayes vs. Rule-
Learning in classification of e-mail, Deprtment
of Computer sciences, The University of Texas
[11]. L. Pelletier, J. Almhana, and V. Choulakian,
Adaptive Filtering of SPAM, Proceedings of the
[12]. CNN 2005, http://www.cnn.com [13]. Nobuo Kumagai Masayoshi Aritsugi . On
Applying an Image Processing Technique to Detecting Spam/garbages. Department of Computer Science, Faculty of Engineering, Gunma University, 1–5–1 Tenjin-cho, Kiryu 376-8515, Japan. 2005
[14]. S B Tan. Introduction to Bayesian Methods for
Medical Research. Division of Clinical Trials and Epidemiological Sciences, National Cancer Centre. 2001
L. Gregory Blanton, M.D. Obstetrics and Gynecology 2300 Hospital Drive Suite # 420 Bossier City, Louisiana 71111 (318) 212-7840 Hormone Pellets What is pellet therapy? Hormone pellet or implant therapy uses hormones derived from natural plant sources to replicate the body's normal hormonal levels. Estrogen pellets may be implanted alone or together with testosterone pe
ERC Hiring & Selection Practices Survey March 2011 Conducted by ERC 6700 Beta Drive, Suite 300, Mayfield Village, OH 44143 440/684-9700 | 440/684-9760 (fax) | www.ercnet.org ERC Hiring & Selection Practices Survey About ERC ERC is Northeast Ohio's largest organization dedicated to HR and workplace programs, practices, training and consulting. ERC membership provides employers