Microsoft word - compliedpapers_final

Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions Recent Developments in Bayesian Approach in Filtering Junk E-mail
Department of Electrical and Computer Engineering, Abstract
standardized definition for spam, however in generally the word “spam” is used to refer to unwanted “garbage” e-mail massages. Spam Junk mail is one of the main problems in constitutes a major problem for both e-mail Internet. There are several methods for the users and Internet Service Providers (ISP). The automated construction of filters to eliminate such unwanted messages from user’s mail system. This paper is mainly concerned about As a result of this growing situation some the Bayesian filtering method and its different automated and manual methods for filtering types of applications in junk e-mail filtering. such a junk from legitimate e-mail are needed. Bayesian technique is trained automatically to Many of junk mail filtering products are available, [4] which allow users to handcraft a implementations that use Bayesian techniques set of logical rules to filter junk mail. The are available as software. Any user can apply this software in different layers of client side construction of rule sets to detect junk mail. It or server side. But spammers are now trying to points out the need for adaptive methods for defeat Bayesian filters by including random dealing with this problem. The automated dictionary words and/or short stories in their learning rules to classify e-mail are introduced messages. The Bayesian filter can be in [4]. While such approaches have shown moderated to block the new spammer some success for general classification tasks techniques. The efficiency of the Baysian filter based on the text of message, the average is greater than the other e-mail filters. If any number of spam messages received continues one wants to filter spam out of email, it is to increase exponentially. Figure 01 shows strongly recommended not to automatically delete messages. The same is true for your real messages received by one e-mail user and email; instead of deleting it, move it to another folder. That way, you'll build a collection of spam and non-spam messages, which will come in handy for training filters. Key words Bayesian, spam, email filtering
r o
1. Introduction
The number of users connected to the Internet is increasing daily. The electronic mail (E- mail) is quickly becoming the fastest and most economical feature in the Internet users. Every e-mail user can function his/her mail account Figure 01: Annual Spam Evolutions
and mailboxes as he/she needs. Unfortunately The spam cost to the ISP can be seen at two some virtues that have made e-mail popular levels; increase of load of e-mail servers and have also enticed flooding of unwanted e-mail. waste of bandwidth. The slower Internet access With the proliferation of direct markets on the is arising according to the bandwidth of the Internet and the increased availability of enormous e-mail address mailing lists the Spam filtering can be applied at the client level volume of junk mail has grown widely in past techniques for filtering spam were proposed in the literature. They are based on the header Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions analysis, address lists, key word lists, digital signatures and content statistical analysis address to get around the recipient’s defenses. 2.4. Rule based filtering
solution in the spam filter, which constitutes Rule based filters assign a spam score to each software. Generally Baysian technique is used e-mail based on whether the e-mail contains in conjunction with the other techniques in the features typical of spam massages, such as spam filtering process. They can be applied as several layers. The Baysian filter divides manually the corps of a high number of e-mail messages into two classes; legitimate (ham) or 2.5. Heuristic filtering
2. Anti Spam Technologies
intelligence to deliver an automated spam Due to the huge increase of spam in the past years the researches pay more attention for filtering spam. Many researches are presently known spam patterns. Main advantage is that working in the implementation of new filters no human actions are needed for filtering here. destination either by blocking it at the sever level or the client level. In January 2003 and 2.6. Receipt-time filtering
2004, a conference on spam took place at MIT Unsolicited Commercial E-mail (CAUCE) [8] SMTP connection, it can use a wide variety of was established. While CAUCE is trying to techniques to detect and reject spam. An effective heuristic test is to see if the incoming connection has valid reverse DNS (rDNS), groups and companies are trying to fight/block giving the sending host’s domain name as well as IP address. While there’s no technical requirement that all sending hosts have rDNS, many people have noted that most hosts 2.1. Centralized filtering server
AOL started rejecting all mail from hosts without rDNS, which impel the few legitimate This architecture used a single anti-spam filter senders without working rDNS to get theirs in that runs on centralized organization-wide mail 2.2. Gateway filtering
2.7. Content filtering
All inbound e-mail is routed through a filtering Once the SMTP server has decided to accept a gateway before being delivered to the mail message, the sender transfers the entire set of server. Gateway services work well with web message headers and the message body. (For SMTP purposes, the message headers are just part of the message, and do not affect message 2.3. List-based filtering
delivery.) Many filtering schemes work on the header and body. This method is richer than the other methods, also it is different to the other methods. This 2.8. Hybrid filtering
method is operating at the server level. Today, black-listing and white-listing are ineffective, While all of the filtering techniques above can although server-based solutions adopt them as be somewhat effective, a combination of many an auxiliary technique often to be integrated resources have become less effective since Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions Many spam filters can be applied as a series. Ho represents a hypothesis, called a null Typically a mail server will use DNS based blacklists (DNSBLs) to reject some mail, then use body filters on the mail that makes it pass P(Ho) is called the prior probability of Ho Some other available filtering methods are: probability of seeing the evidence E given that the hypothesis Ho is true. It is also methods, Content-based filtering, Checksum- called the likelihood function when it is based filtering, Sender-supported whitelists P(E) is called the marginal probability of 3. The Bayesian Method
P(Ho|E) is called the posterior probability Bayesian filtering is a statistical approach, which was used by many researchers to build a The factor P(E|H0)/P(E) represents the impact spam filter. Paul Graham made a significant that the evidence has on the belief in the contribution to this domain in implementing hypothesis. If it is likely that the evidence will and testing one of the first Bayes spam. [7] consideration is true, then this factor will be improvement to this filter. He produced a large. Multiplying the prior probability of the number of alternative approaches to combining hypothesis by this factor would result in a large posterior probability of the hypothesis architecture of the spam Bayesian system has given the evidence. Under Bayesian inference, three different parts: Tokenizing, combining much new evidence should alter a belief in a 3.1. Mathematical explanation
Bayesian e-mail filters take advantage of probability that is greater than 1. Since P(E) is Bayesian theorem. Bayesian theorem, in the context of spam, says that the probability that an e-mail is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam e-mail, yield a posterior probability of 1. Therefore the times the probability that any e-mail is spam, posterior probability could yield a probability divided by the probability of finding those greater than 1 only if P(E) were less than Interface of the Baysian uses numerical estimate of the degree of belief in a hypothesis before evidence has been observed. Baysian Alternative
interface usually relies on degree of belief or subjective probabilities. Baysian theorem adjusts probabilities given new evidence in the Bayes’ theorem is often blown up by nothing proportional to the prior probability times the (Often called “not Ho”). The theorem can be likelihood. In addition the ratio, Pr(E|Ho) Pr(B) is sometimes called the standardized Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions The Bayes’ theorem in terms of odds and mail in its database. For instance, Bayesian likelihood ratio can be explained as follows: spam filters will typically have learned a very high spam probability for the words “Viagra” and “refinance”, but a very low spam Bayes’ theorem can also be written neatly in probability for words seen only in legitimate e- terms of a likelihood ratio Λ and odds O as mail, such as the names of friends and family members. After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an e-mail with a particular set of words in it belongs to which category. Each word in the e-mail contributes to the e-mail’s spam probability. This contribution is called the posterior probability and is computed using Bayes’ theorem. Then, the e-mail’s spam probability is computed over all words in the e-mail, and if the total exceeds a certain threshold (say 95%), the filter will mark the e-mail as spam. E-mail marked as spam can then be automatically moved to a “Junk” e-mail folder, or even deleted outright. 4. Spam filtering mechanism of
Baysian technology
There are several algorithms that use various modifications of Bayesian technique. 3.3. Process
This section mainly explains the Baysian theorem used in junk mail filtering. According to the section 3.1 and 3.2 we can summarize the Bays’ theorem in junk mail detection as appearing in a spam/garbage message, and Use those odds as input to Bayes’ Formula to determine if the message is garbage or The first thing needed to do is to teach the Using Bayesian analysis to classify spam and Bayesian filter the difference between garbage non-spam was suggested by Paul Graham. A and non-garbage messages. We can identify Bayesian filter takes each word in a message the spam or garbage e-mails from the content and looks it up in a database to see how many of e-mails. Most of spam e-mails contain times that word has appeared in prior spam and certain key words. Instinctively, people know that a message containing these words or then lets it combine those counts into an phrases is garbage/spam because of their overall probability estimate to check whether The Bayesian filter does not have the benefit Particular words have particular probabilities of our years of experience, so we have to teach of occurring in spam e-mail and in legitimate it what spam/garbage messages look like, and e-mail. For instance, most e-mail users will how they differ from non-garbage messages. frequently encounter the word Viagra in spam The filter needs to introduce the garbage mail e-mail, but will seldom see it in other e-mail. The filter does not know these probabilities in message to the filter, it finds every word in the advance, and must first be trained so that it can message and stores it (along with how many build them up. To train the filter, the user must manually indicate whether a new e-mail is spam or not. For all words in each training e- Separate databases are kept for garbage and mail, the filter will adjust the probabilities that non-garbage mail messages. The filter uses a each word will appear in spam or legitimate e- looser definition of a word than humans do – a Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions word (more properly called a token) can also article about the relative value of the Euro is be an IP address, a host name, an HTML tag, going to confuse the audience so much they or a price (such as “Rs100”). However a token cannot be random strings, words less than The filter scans through the message, creating a list of every word it knows about (in other words, every word in the message that’s also in the token databases). In this example, the words it knows about are “prescription”, “when”, “today”, “visit”, and “your”. Once the filter has the list of words it knows about, for each word it calculates the probability that Figure 02: Creating a word database for the
This probability value assigned to each word is commonly referred to as spamicity, and ranges from 0.0 to 1.0. A spamicity value greater Before mail can be filtered using this method, than 0.5 means that a message containing a the user needs to generate a database with particular word is likely to be spam/garbage, words and tokens collected from a sample of while a spamicity value less than 0.5 indicates spam/garbage mail and valid mail (referred to that a message containing that word is likely to value of 0.5 is neutral, meaning that it has no A probability value is then assigned to each effect on the decision as to whether a message word or token; the probability is based on calculations that take into account how often that word occurs in spam/garbage as opposed to legitimate mail (ham). This is done by spam/garbage mail senders (spammers) are analyzing the users’ outbound mail and by spam/garbage words in their spam/garbage When we create the ham database the Baysian method does not require an initial learning period, it has 2 major flaws: The first (and obvious) goal was to see if the spammer could sneak the message past the Bayesian filter by including obviously The ham data file is publicly available and and therefore bypassed. If the ham data file The secondary goal was to try to get the filter to start recognizing the words “congresswoman” and “soybean” as can get the filter to assign a high spamicity Such a ham data file is a general one, and This circumvention method also has limited Besides ham mail, the Bayesian filter also relies on a spam/garbage data file. This spam/garbage data file must include a large senders (spammers) start including a large pile sample of known spam/garbage and must be of legitimate text (such as an article from the software. This will ensure that the Bayesian filter is aware of the latest spam/garbage tricks, growth supplements that also contains an resulting in a high spam/garbage detection rate Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions (note: this is achieved once the required initial point where it gives a lot of false positives When actual filtering is working, once the ham 5.4. Bayesian Mail Filter (BMF)
BMF is another option. It is very small - only calculated and the filter is ready for use. When 4600 lines of code, 110 KB. It is quite fast. In a new mail arrives, it is broken down into addition to SourceForge any person can find it words and the most relevant words – i.e., those that are most significant in identifying whether the mail is spam/garbage or not – are singled out. From these words, the Bayesian filter 5.5. iFile
calculates the probability of the new message being spam/garbage or not. If the probability is ifile collects statistics on the occurrences of greater than a threshold, say 0.9, then the message is classified as spam/garbage. This filed/refiled, and uses that to determine a “best Bayesian approach to spam/garbage is highly guess” of where new mail should best be filed. effective – a May 2003 BBC article reported Some researches have done quite a lot of work that spam/garbage detection rates of over to tune it to provide decent performance. The idea is to collect a dictionary of statistics on the number of occurrences of words in 5. Applications in E-mail filtering
with Baysian technology
messages are compared to the dictionary, and are filed to the folder to which they have the The Bayesian technique is widely used in many technological areas. Image processing [12], Microscopic image analyzing, medical misclassified by the filter), the dictionary is research [13], Detecting Speech Recognition revised [23]. Words that are not commonly Errors [14] are the mostly used areas in this used are eliminated, so that the dictionary does technique. This review paper is trying to concern the Baysian technique in the junk mail ifile uses naive Bayesian filtering as a statistical approach to direct messages to MH 5.1. SpamAssassin
folders to which they have the highest degree SpamAssassin is a rules-based filter written in The “naive” assumption that is made is that Perl. It was used for a while but spammers correlations need only be done on the basis of rapidly figured out how to get around each individual word occurrences, that is, we count new rule. So it was becoming less and less the number of times that the word “stop” is effective [15]. In version 2.5 the developers used, and do not consider combinations of words (e.g. “stop it” or “stop and go”). problem. Besides, since it is still in Perl, it is The weakness of this scheme is that while it is 5.2. Bogofilter
classifying dissimilar documents into different Bogofilter was one of the first Bayesian filters. groups, it has nothing that makes it good at Originally by über-hacker Eric S. Raymond, it’s written in good old-fashioned C and runs fast [16]. If it has a weakness, it is being little too conservative about rating things as spam. 5.6. dbacl

5.3. Quick Spam Filter (QSF)
A digramic Bayesian filter, is not restricted to just spam and non-spam. This mail filter will QSF is a more recent Bayesian filter. It is also classify a message into one of many categories bogofilter. The scores it generates seem to skew somewhat higher than bogofilter’s, to the Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions Another junk mail filters are also available like mail senders (spammers) started using “f-r-e- SPASTIC, SpamProbe. These all are based on e” instead of “free” they succeeded in evading keyword checking until “f-r-e-e” was also included in the keyword database. On the other 6. The need of Bayesian technique
hand, the Bayesian filter automatically notices such tactics; in fact if the word “f-r-e-e” is There are some experiment results of Baysian found, it is an even better spam/garbage filters and non-Baysian filters. Each program indicator, since it is unlikely to occur in a ham was installed according to its documentation. For the filters that required training, the training set data was supplied. Each filter was The Bayesian technique is sensitive to the user taken in turn and executed once for each e- mail in the spam and legitimate sets and the company/organization and understands that, for example, the word ‘mortgage’ might The standard metrics for text classification are company/organization running the filter is, say, recall and precision. Spam classified as non- a car dealership, whereas it would not indicate spam is known as a false negative. Non-spam it as spam/garbage if the company/organization classified as spam is known as a false positive. Precision is the percentage of messages that were classified as spam that actually are spam. The Bayesian method is multi-lingual and High precision is essential to prevent the international - A Bayesian anti-spam/garbage messages we want to read being classified as filter, being adaptive, can be used for any spam. A low precision indicates that there are language required. Most keyword lists are many false negatives. Recall is the percentage available in English only and are therefore of actual spam messages that were classified as quite useless in non English-speaking regions. spam messages. High recall is necessary in order to prevent our inbox filling with spam. A A Bayesian filter is difficult to fool, as low recall indicates that there are many false opposed to a keyword filter - An advanced positives. According to the experiment by spammer who wants to trick a Bayesian filter using the several types of Bayesian mail filters indicate spam/garbage (such as free, Viagra, etc), or more words that generally indicate valid mail (such as a valid contact name, etc). features against the RIPPER rule-learning 7. Conclusion
algorithm in different e-mail classification tasks. In learning a user’s foldering preferences The Bayesian filters, after training, offer better and learning to detect spam, the Bayesian filter recall than the heuristic filters. Catching a higher proportion of spam is clearly good, since that is the reason people use them. With insufficient training, however, the Bayesian filters perform poorly in comparison with 6.1. Advantages of Bayesian filter
SpamAssassin in terms of recall. But some of Bayesian filters work very poorly (Quick The Bayesian method takes the whole message Spam) compared with other Bayesian filters. into account. It recognizes keywords that identify spam/garbage, but it also recognizes The Bayesian filter outperforms by far the keyword-based filter, even with very small filtering is a much more intelligent approach because it examines all aspects of a message, as opposed to keyword checking that classifies As future work, we can plan to implement a mail as spam/garbage on the basis of a single alternative anti-spam filters, based on other machine learning algorithms. Also the filter can include the foddering mechanism to check A Bayesian filter is constantly self-adapting - the spam if user need it. The server side mail By learning from new spam/garbage and new filters are most effective, because many users valid outbound mails, the Bayesian filter can be protected from the filters. More than one filter can be activated in a server as a set of techniques. For example, when spam/garbage layers, and it is more effective than one filter Sri Lanka Association for Artificial Intelligence (SLAAI) Proceedings of the third Annual Sessions because if one filter fails another one can be [15]. Lina Zhou1, Jinjuan Feng1, Andrew Sears1, Yongmei Shi2. Applying the Naïve Bayes Classifier to Assist Users in Detecting Speech 8. References
Recognition Errors. 1Information Systems [1]. L. Pelletier, J. Almhana, V. Choulakian, 2Computer Science and Electrical Engineering Adaptive Filtering of SPAM, Proceedings of [16]. SpamAssassin, [2]. Mingjun Lan, Wanlei Zhou, Spam Filtering [17]. Bogofilter, based on Preference Ranking, Proceedings of the 2005 The Fifth International Conference on [3]. M Sahami, S. Dumais, D. Heckerman, and E. Horvitz, A Bayesian approach to filtering junk e-mail, AAAI’98 Workshop. Learning/or Text [20]. dbacl, Categorization, Madison, WI, July 27, 1998. [4]. W. W. Cohen, Learning rules that classify e- mail, In Proceedings of the 1996 AAAI Spring [22]. SpamProbe, [24]. Jason D. M. Rennie, ifile: An Application of [6]. Paul Graham, Better Bayesian filtering, Artificial Intelligence Lab, Massachusetts [8]. Paul Graham, A Plan for Spam, August 2002, [9]. The Coalition Against Unsolicted Commercial experimental comparison of naive bayesian and personal e-mail messages. In Proceedings of [10]. X. Carreras and L. Andm, Boosting trees for anti-spam e-mail filtering, In Proceedings of RANLP-2001, 4th International Conference on [27]. Jefferson Provost, Naïve-Bayes vs. Rule- Learning in classification of e-mail, Deprtment of Computer sciences, The University of Texas [11]. L. Pelletier, J. Almhana, and V. Choulakian, Adaptive Filtering of SPAM, Proceedings of the [12]. CNN 2005, [13]. Nobuo Kumagai Masayoshi Aritsugi . On Applying an Image Processing Technique to Detecting Spam/garbages. Department of Computer Science, Faculty of Engineering, Gunma University, 1–5–1 Tenjin-cho, Kiryu 376-8515, Japan. 2005 [14]. S B Tan. Introduction to Bayesian Methods for Medical Research. Division of Clinical Trials and Epidemiological Sciences, National Cancer Centre. 2001


L. Gregory Blanton, M.D. Obstetrics and Gynecology 2300 Hospital Drive Suite # 420 Bossier City, Louisiana 71111 (318) 212-7840 Hormone Pellets What is pellet therapy? Hormone pellet or implant therapy uses hormones derived from natural plant sources to replicate the body's normal hormonal levels. Estrogen pellets may be implanted alone or together with testosterone pe

Microsoft word - 11-hiring-selection-practices-survey

ERC Hiring & Selection Practices Survey March 2011 Conducted by ERC 6700 Beta Drive, Suite 300, Mayfield Village, OH 44143 440/684-9700 | 440/684-9760 (fax) | ERC Hiring & Selection Practices Survey About ERC ERC is Northeast Ohio's largest organization dedicated to HR and workplace programs, practices, training and consulting. ERC membership provides employers

Copyright © 2010 Medicament Inoculation Pdf