bayesian_filter
description
Execution of a Bayesian Filter over the last 20 Tweets from TSN.
purpose
Bayesian Filtering techniques are employed by email anti-spam engines to sift through email body content, comparing it to a corpus of both known good and bad emails. If the email under test is more 'like' the bad emails, than the good emails, it can be flagged as spam with a good degree of confidence.
Mathematical models that represent and define both twam and non-twam Tweets from the public timeline are built on a regular basis in order to track the Tweet constructs of both typical and spam accounts. These models are then fed into a Bayesian Filter during the comparison of the TSN Tweets.
grading
| GRADE | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| DESC | Mean average of 0-20% Spam | Mean average of 21-40% Spam | Mean average of 41-60% Spam | Mean average of 61-80% Spam | Mean average of 81-100% Spam |
example
The below module output comes from processing a definite twam account - none of the last 20 Tweets are ham, all have been determined as spam by the Bayesian Filter with an extremely high probability of 99%. Such a positive result classifies this TSN with a grade of 5.
<bayesian_filter> <date>1265537834</date> <exec_time>636</exec_time> <raw_data> <tweets>20</tweets> <ham_count>0</ham_count> <ham_mean>0</ham_mean> <spam_count>20</spam_count> <spam_mean>99</spam_mean> </raw_data> <result>5</result> </bayesian_filter>
The next example is from a TSN that the Bayesian Filter has determined 5 of the last 20 Tweets as being spam-like, but with only 32% probability. The remaining 15 Tweets are classified as ham with a 29% probability.
Overall, we look only at the spam mean average and determine a result of 2.
<bayesian_filter> <date>1265538058</date> <exec_time>516</exec_time> <raw_data> <tweets>20</tweets> <ham_count>15</ham_count> <ham_mean>29</ham_mean> <spam_count>5</spam_count> <spam_mean>32</spam_mean> </raw_data> <result>2</result> </bayesian_filter>
data
Models of what constitutes both 'good' and 'spammy' Tweets are maintained by TWASE and are regularly re-computed over an internal corpus to ensure current relevance. The corpus size is formed of approximately 10,000 last Tweets from Twitter Screen Names considered as non-spammy and 10,000 confirmed spam Tweets.
notes
none