bayesian_filter

description

Execution of a Bayesian Filter over the last 20 Tweets from TSN.

purpose

Bayesian Filtering techniques are employed by email anti-spam engines to sift through email body content, comparing it to a corpus of both known good and bad emails. If the email under test is more 'like' the bad emails, than the good emails, it can be flagged as spam with a good degree of confidence.

Mathematical models that represent and define both twam and non-twam Tweets from the public timeline are built on a regular basis in order to track the Tweet constructs of both typical and spam accounts. These models are then fed into a Bayesian Filter during the comparison of the TSN Tweets.

grading

GRADE 1 2 3 4 5
DESC Mean average of 0-20% Spam Mean average of 21-40% Spam Mean average of 41-60% Spam Mean average of 61-80% Spam Mean average of 81-100% Spam

example

The below module output comes from processing a definite twam account - none of the last 20 Tweets are ham, all have been determined as spam by the Bayesian Filter with an extremely high probability of 99%. Such a positive result classifies this TSN with a grade of 5.

<bayesian_filter> 
	<date>1265537834</date> 
	<exec_time>636</exec_time> 
	<raw_data> 
		<tweets>20</tweets> 
		<ham_count>0</ham_count> 
		<ham_mean>0</ham_mean> 
		<spam_count>20</spam_count> 
		<spam_mean>99</spam_mean> 
	</raw_data> 
	<result>5</result> 
</bayesian_filter> 

The next example is from a TSN that the Bayesian Filter has determined 5 of the last 20 Tweets as being spam-like, but with only 32% probability. The remaining 15 Tweets are classified as ham with a 29% probability.

Overall, we look only at the spam mean average and determine a result of 2.

<bayesian_filter> 
		<date>1265538058</date> 
		<exec_time>516</exec_time> 
		<raw_data> 
			<tweets>20</tweets> 
			<ham_count>15</ham_count> 
			<ham_mean>29</ham_mean> 
			<spam_count>5</spam_count> 
			<spam_mean>32</spam_mean> 
		</raw_data> 
		<result>2</result> 
</bayesian_filter> 

data

Models of what constitutes both 'good' and 'spammy' Tweets are maintained by TWASE and are regularly re-computed over an internal corpus to ensure current relevance. The corpus size is formed of approximately 10,000 last Tweets from Twitter Screen Names considered as non-spammy and 10,000 confirmed spam Tweets.

notes

none

 
module/bayesian_filter.txt · Last modified: 2010/02/28 10:46 (external edit)