## -- mark for start of standard filter set -- ##
## Experimentals
# Leading space, sentence
/^\s[[:alnum:]"[:space:]]+\.\.$/ (content)
# Random text without links, ending in '...' with leading space.
/^\s[[:upper:]].+\.\.\.$/ (content)
# Repeated one and two word phrases (3 repeats)
/(\w+)(?:\s+\2){2}/ (text)
/(\w+\s+\w+)(?:\s+\2){2}/ (text)
# lower case words, dash, URL as entire comment.
/^(?:[[:lower:]]+\s+)+-\s+http://[^\s]+$/ (text)
# random string of numbers and digits for email address or domain
/[[:digit:]]+[[:lower:]]+[[:digit:]]+[[:lower:]]+/ (mail)
# A nasty little junker with the form of lower case text, link with lower case
# text as the entire comment.
/^[[:lower:][:space:][:digit:]'-]+]+>[[:lower:][:space:][:digit:]'+-]+\s*$/ (content)
# Raw URLs separated by whitespace
/http://[[:alnum:]_./-]+[[:space:]]*
[[:space:]]*http:/// (content)
# Hex strings at the front and/or back to fool us. Haha, that's just the
# kind of non-human signature I like to see.
/^\s?[a-f0-9]{10,}\s/i (text) 3
/[a-f0-9]{10,}\s?$/i (text) 3
# short intervening text possiby with spaces
/]+>\s*[[:alnum:]]{1,2}\s*]+>[[:lower:]]+\s+[[:lower:]]+$/ (content)
# Excitable junker
/\s!!\s/ (text)
# The greeters
/^Hi[\s,!<]/i (text)
/^Hello/i (text)
/^Holla /i (text)
/^Hoh! /i (text)
## Standard structure checks
# The pipe guys. Trigger on a URL or an anchor followed by a pipe
/[^'"]http://[[:alnum:]/._-]+\s*[|]/ (text) 2
/]+>[^<]*\s*[|]/i (text) 2
# Empty field checks for trackbacks (somebody's now putting a single space in)
/^\s*$/ (source title blog) 3
# Now getting junk that is just two uncapitalized, unpunctuated words
# This forbids text that consists solely of lower case letters, spaces, dashes and percents
/^[[:lower:][:digit:]\s%-]+$/ (text) 2
# Forbid single words without a sentence terminator.
/^\s*[[:alnum:]]+\s*$/ (text) 2
# Double dashes URL
-- (email url) 3
# Link text is the URL
/[^<]*\3[^<]*<\/a>/i (text) 3
# URL immediately after an anchor to that URL
/[^<]+<\/a>\s*\2/i (text) 3
# Keywords from URL repeated in link text.
/href="http://(?:www\.)?([^/">-]+)-([^/">-]+)\b[^>"]*">\s*\2\s+\3/i (text) 3
# Link text is domain name
/href="[^"]*\.([^."]+)\.[a-z]+">\s*\2\s*]+>\s*([^<]*)\s*\s*\2/i (text) 2
# Homepage URL looks like an archive
/[[:digit:]]{3,}\.(?:html|htm|shtml|php)$/i (home) 3
# file name not used by real people.
/\/[[:alpha:]]{1,2}[[:digit:]]+$/ (source) 3
# Empty link - normal people don't do that.
/]+href[^>]+>\s*<\/a>/i (content) 3
# Content that is just links (plus white space & certain punctuation)
/^(?:]+>[^<]+\s*(?:
)?[\s,\\/]*)+$/i (text) 2
# Content that has 3 or more links in a row without any separating text.
/(?:]+>[^<]+\s*(?:
)?\s*){3}/i (text) 2
# Use of BB syntax is determinitive.
[/url] (content) 2
# capitalized attributes are indicative (have seen legit capitalized tags)
/ 2
# No quotes is a bad sign
href=http (text) 2
# Real humans don't bother putting in empty titles
title="" (content) 2
# Purely numeric subdomains are bad.
/http://[[:digit:]]+\.[[:alpha:]]/ 2
# No links allowed in source weblog nor commentor name
href (blog name) 3
http: (blog name) 2
# Redirects in the homepage are bogus
/=http/i (url) 2
# Generic phrase that's a bad sign.
/buy \w+ online/i 2
# This looks for a file name that is repeated in the link text
# and in the immediately following text.
/]*href=['"][a-z:/.]+/([a-z_A-Z]+)s?.html['"][^>]*>[\s\w]*\2[\s\w]*[\s\w]*\2/ (text) 2
# Raw URLs in a list with separators (pipe junkers caught in a different filter)
/http://[^\s|]+\s*[/]\s*http:/i (text) 2
## Standard URL fragment checks
# These are words that are too common to ban outright and we want
# to check for in the text. So we look for them as part of a URL
# which normal people don't do.
-betting 2
betting- 2
-buy 2
buy- 2
/-casino/i 3
/casinos?-/i 3
cell-phone 3
cell-phones 3
-cheap 2
cheap- 2
-credit 2
credit- 2
-dating 2
dating- 2
-debt 2
debt- 2
-discount 3
discount- 3
-download 3
download- 3
-enhancement 2
enhancement- 2
-generic 2
generic- 2
-holdem 3
holdem- 3
hold-em 3
-insurance 2
insurance- 2
-jpg 2
jpg- 2
-keno 2
keno- 2
-lending 2
lending- 2
-loan 2
loan- 2
-loans 2
loans- 2
-mortgage 2
mortgage- 2
-nude 2
nude- 2
-on-line 2
on-line- 2
-online 2
online- 2
-payday 2
payday- 2
-pharmacy 2
pharmacy- 2
-pill 2
-pills 2
pill- 2
pills- 2
-poker 2
poker- 2
-porn 2
porn- 2
-porno 2
porno- 2
preteens- 2
-preteens 2
-prescription 2
prescription- 2
-rape 2
rape- 2
-refinance 2
refinance- 2
replica- 2
-ringtone 2
ringtones- 2
-ringtones 2
ringtone- 2
-roulette 2
roulette- 2
-sex 2
sex- 2
-slot 2
-slots 2
slot- 2
slots- 2
-spyware 2
spyware- 2
## Things we don't want to see in links.
# In the future, I need to write a keyword filter that looks only inside
# link text.
>buy 2
>casino 2
casino< 2
>casinos 2
casions< 2
>cheap
cheap<
>discount
discount<
>free 2
>mp3 2
mp3< 2
>online 2
online< 2
>order 2
>porn 2
porn< 2
>valium 2
valium< 2
>video
## Bad words
# There are bad everywhere. Some are bad even as parts of
# other words (those are bracketed by slashes)
ambien 2
bdsm
cash advance
cash loan
carisoprodol 2
/empirepoker/i 2
fones 3
femdom 2
fioricet 2
fuck
fucker
fucking
hoodia 2
hydrocodone 2
lipitor 2
milf
mortgage refinance
muchila 2
nude preteen 2
online casinos
online directory
online links
online slot 2
online slots 2
online poker 2
/onlinepoker/i 2
oxycodone 2
oxycontin 2
/partypoker/i 2
/phentermine/i 5
rulez
shemale 5
teen sex 3
tramadol 2
vicodin 2
xanax 2
zoloft 2
## Bad words for fields other than text
anal (email url blog name title) 3
bbs.php (url)
blackjack (email url blog name title) 2
black jack (email name blog) 2
bestiality (email name blog title) 2
casino (email url blog name title) 3
casinos (email url blog name title) 3
cgi-bin (url) 2
credit (email url) 2
/creditcard/i (email url) 2
diet (blog url title)
diets (blog url title)
gambling (email url blog name title)
google (url) 2
holdem (email url blog name title) 2
hold em (email url blog name title) 2
insurance (email url) 3
/mp3/i (url email blog) 2
mortgage (email url blog name title) 3
/pharmacy/i (email url blog name title) 3
poker (email url blog name title) 3
porn (email url blog name) 3
porno (email url blog name) 3
pussy (email url blog name) 3
/ringtones?/i (email url blog name) 3
sex (email url blog name) 3
slots (url blog name) 2
transsexual (email url blog name) 3
viagra (email url blog name) 3
wagering (url blog name) 3
## Impolite words for fields
# We do these at one each, since the scores inside the filter add, not average.
betting (url blog)
buy (url blog)
cellphone (url)
cellular (url name blog)
compare (url blog title)
discount (url blog title)
download (url blog title)
gay (url blog)
loan (url blog)
loans (url blog)
most popular (url blog title)
pagers (url blog title)
payday (url blog name)
phones (url blog title)
pissing (email url blog name)
prices (url blog title)
sale (url blog title)
spanking (url blog title name)
/xxx/i (url blog title email)
## Banned domains
# These are unusual enough that we can just ban them everywhere.
herbal-source 3
blogsharing 3
creekrugby 3
easyjournal 2
editme.com 3
free20 3
freewebs 2
hometown.aol 3
ifastnet 3
servik.com 3
xoomer 3
alice.it 3
# Top level domains
.ar/
.be/
.fr/
.it/
.pl/
.ro/
.to/
# And the Blogspot colonizers
/-\w+-[a-z0-9]{4}\.blogspot/ 2
/-.\.blogspot.com/ 2
## Banned commentor names
ssoboy (email) 3
hare.jp (emai) 3
neo@hotmail (email) 3
/comehome/ (email) 2
web@ (email) 3
webmaster@ (email) 1
info@ (email) 3
reply@ (email) 3
test@ (email) 2
@email (email) 3
@freemail (email) 2
@inbox (email) 2
/@[[:digit:]]*mail\b/ (email) 3
@themail (email) 2
mailbox.com (email) 3
# all numeric email address is so wrong.
/^[[:digit:]]+@/ (email) 3
# So is a numeric subdomain
/@[[:digit:]]+\./ (email) 3
# No equal signs allowed in email
= (email) 3
## Other bannings
/Web Directory/(blog title) 3
La Cocina (text) 2