visual frame element
decorative banner
Portal Projects Articles Pictures About me Creations Imprint
visual frame element
Project Details

SPAM - Jeaws & JBayes

» projects



These two little programs were developed in the context of a paper for the lecture "IT-Security" at Fulda University of Applied Sciences. The topic of the paper was "Spam".
When it comes to spam, there will always be to opposite parties: The Spammers and the ones who try to detect that spam and filter it. To show other students, what tools are used on either side and to demonstrate, that it is possible to create such tools without much effort, these to little programs were created:

Jeaws

Jeaws means "Java E-Mail Address Web Spiders". What it basically does, is browsing the web, scanning websites for e-mail addresses an store them on the hard disk. To start, you give the tool a starting point. Besides just searching for E-Mails in the current website it will also search for links that lead to new pages. This way, it automatically continues to search other websites and basically never stops.

To run the program, use it like this:

java Jeaws http://www.website.com -Xmx768m

The last parameter makes sure there is enough memory to store the further links found. This program is not very optimized and I can think of a lot of improvements. For example: It would be nice, if the program stops scanning all pages of a given top level domain, if X pages have not given any e-mail addresses. Because then it's very unlikely to find addresses on other pages. For example, if the spiders get to eBay, they won't come out of there very quickly because the are trillions of links inside eBay to other eBay pages. But it's very unlikely to find many e-mail addresses here. So it would be nice to just remove all eBay links automatically out of the further links database.

To store the information, the program saves the e-mail addresses and found links in two files: spider.email for e-mail addresses and spider.websites for further links. But the program does not yet read the information back, when it starts. So it will continue from the beginning. Another point for improvement, as you can see! But as I said, this program, as well as the next one shown, is just for teaching purposes.
Although the program is not that fancy I added a neat little feature. When running, you can access a website that shows current statistical data: http://localhost:9999

JBayes

We have seen a tool that a spammer could use to get many e-mail addresses in no time. Now what's with the people, that receive spam (so basically everybody who has an e-mail account and actually used it)? Since 2002, Bayesian Filtering has become very popular. This algorithm is able to classify messages, based on a knowledge based that was trained in the past. Training is fairly simple. In the case of spam classification (and such filters can classify not only spam or not-spam), the user just tells the filter, what messages are spam. This is it. The knowledge base will automatically be extended and the filter learned. It will consider the new information from now on, when new mail is received and a new decision has to be made.

The program shown here is based on an article by John Graham-Cunning "Build Your Own Bayesian Spam Filter", from May 2005. He explained, who to implement a Na´ve Bayesian Filter in Perl. So what I basically did was take this idea and implement the same procedure in Java.

The result is a program, that can handle several classes that you can create. You can dynamically add text to either of these classes and the filter will be trained automatically. At any time you can try to classify message and the filter will show you the result.
Here is, how it works: To add new text files to a certain category, do the following:

java JBayes add category myfile.txt

The program itself will only read one file per call. But it's easy to add this to a script that will perform this task. The program will train and save it's knowledge base to a file called database.ser. It will automatically read it with every execution, so nothing is lost. If you want to remove the knowledge base, just remove this file.
Now let's say you trained your program and you want to classify a text file. This is what you would have to do:

java JBayes classify myfile.txt

The program will now prompt all category with a number. This number will tell you the probability, that the given text file falls into that category. The higher the number, the more likely it is.

Attached you will find the source code as well as a pre-compiled version.

visual frame element