|
Spam Conference 2005
Spam Conference 2005 took place at MIT in Cambridge, MA on
January 21, 2005.
This was the third such conference taking place at MIT. The
MIT spam conferences are unlike most spam conferences in that the presentations
at the MIT conferences have a more academic focus while most other
spam conferences have been more vendor-oriented.
The change in focus does have its advantages, but it has
its drawbacks as well:
several of the presentations were a distinct demonstration
of ivory tower
syndrome, with the discussions being highly abstract and showing
little real-
world understanding of the spam problem or the ability to
develop practical
solutions. These more abstract presentations were supplemented
with several
presentations which may be described as anthropological, such
as the
presentation which described in some depth the reaction of
people to spam.
Unfortunately, with a few exceptions, there was little presented
at the
conference which would be useful information for an email
or systems
administrator. A brief description of each presentation follows.
-----------------------------------------------------------------------------------------------------
A Unified Model of Spam Filtration Bill
Yerzunis
Mitsubishi Electric Research Lab
Filtering spam is a hard problem, and the same filter may
produce different
results for different users. All filters use the same steps:
(1) Pre-processing. Text-to-text transformation.
(2) Tokenization. Processed text is segmented into token strings
using regular expressions and then converted into token IDs.
(3) Token stream converted into features which can be defined
mathematically. All features are unique.
(4) Features are given weights by referencing a look-up table.
These look-up tables are prebuilt by the learning side of
the filter.
(5) Weights are combined and a state vector is produced which
is then compared with a threshold value.
Better input for the learning side of filters will improve
efficacy, and the
insertion of outgoing email would be most useful in this regard.
-----------------------------------------------------------------------------------------------------
Bayesian Spam Classification Applied to Phishing Fraud
Andrew Klein
MailFrontier, Inc.
A phish has two essential features in order to be effective:
(1) it must build credibility; and (2) it must establish
a reason for the user to take action.
Distinguishing phish messages from legitimate messages is
not easy; 31% of users
surveyed identified a phish as real, and 19% identified a
real message as a
phish. Fraud false positives are particularly bad.
Filters can be developed based on words financial institutions
are not likely to
use such as fraud, spoof, suspension, renew, etc. Cannot rely
solely on content
to identify phish because perpetrators are no longer just
ripping off corporate
logos. Phishes are now much more sophisticated and closely
resemble legitimate
transactional messages.
In order to train filters, sufficient good email must be
provided in order to
offset fraudulent messages. False positives are more likely
without adequate
training. Therefore, one of the first steps is to identify
valid transactional
messages, and then factor in other tricks commonly used by
fraudsters such as
non-standard ports and look-alike domains. Identification
of valid
transactional messages is particularly important because it
allows for a
different set of weighting factors than regular spam filters.
-----------------------------------------------------------------------------------------------------
Bayesian Noise Reduction: Contextual Symmetry Logic
Utilizing Pattern
Consistency Analysis. Research from the DSPAM Project
Jonathan Zdziarski
Some data in an email message should not be taken at face
value because spammers
are injecting "noise" into their messages in order
to evade spam filters. A
token produced by a Bayesian filter can resolve to one disposition
in one
context, but a completely different disposition in another
context. There are
four noise injection tactics utilized by spammers: common
noise, recurring junk
words, arbitrary word lists, and directed word lists. Noise
reduction seeks to
eliminate this out-of-context text. The first step in noise
reduction involves
the production of machine-generated context patterns. The
second step, called
dubbing, removes inconsistencies from token values. Whitepaper
on topic is at
http://bnr.nuclearelephant.com.
-----------------------------------------------------------------------------------------------------
Using Lexigraphical Distancing to Block Spam
Jonathan Oliver
MailFrontier, Inc.
Generally speaking, there are two types of filters: personal and server-side.
Personal filters can be re-trained and can focus on terms
which reflect good email. Server-side filters are less focused
on terms which might reflect good email, and they are also
likely to be only a part of a layered approach. Server-side
filters catch less spam. Consequently, personal filters and
server- side filters represent two different problems.
Humans are good at reading jumbled text, and spammers know
this. There are over
600 trillion variants of "Viagra". Three approaches
to addressing this problem
exist: regular expressions, spell checking and lexigraphical
distancing.
Regular expressions are time consuming, less robust, not intuitive,
and it is
impossible to catch all variants using regular expressions.
Spell checking does
not do well with letters constructed from other elements,
e.g., \ /, and it is
not good with split words or words run together.
With lexigraphical distancing, an algorithm estimates the
probability that the
content being inspected is an "edited" version of
the content in question. In
an experiment, the edit distance algorithm caught 27% of the
spam fed to it, and
incorrectly identified spam phrase variants at the rate of
0.19%, which is a
very low false positive rate. Lexigraphical distancing *potentially*
produces
an 11 to 13.5% improvement over "naive" Bayesian
filters, and is very useful in
server-side filtering. It scales, and is computationally feasible.
However,
claims are unverifiable because MailFrontier will not publish
its methodology or
data. MailFrontier uses lexigraphical distancing as a pre-processor.
-----------------------------------------------------------------------------------------------------
Classifier Aggregation
Richard Segal
IBM Spam Research Center
As its name suggests, classifier aggregation combines multiple
classifiers to
predict spam. This approach is more difficult to attack because
multiple
algorithms must be broken. Classifier aggregation catches
more spam with less
false positives. It emphasizes each algorithm's strengths
and de-emphasizes
each algorithm's weaknesses.
A robust classifier aggregation model may consist of eight
elements:
(1) SMTP path analysis; (2) user whitelists; (3) user blacklists;
(3) global whitelists;
(4) global blacklists; (5) intelligent renderers; (6) plagiarism
detection; (7) Bayesian classifier; (8) DNA
pattern discovery. Plagiarism detection computes similarity
to documents in corpus and assigns label of most similar document.
This model is very conservative in order to reduce false positives.
A message is determined to be spam only if all classifiers
identify a message as spam.
Equal weight aggregation offers improvement over Bayesian
filtering, but
Bayesian filtering performs better than maximum or minimum
aggregation. Optimal
weighting can be determined through linear regression analysis.
Dynamic
aggregation adjusts weights in real time. This results in
reduced false
positives and improved performance. Equal weight aggregation
does not adapt
well to environment changes. Linear regression with quadratic
weights is better
than any single classifier and adapts well to environment
changes.
Copy of presentation and other information is available at
http://www.research.ibm.com/spam.
-----------------------------------------------------------------------------------------------------
Distinctions Between Message Authentication and User
Authentication
Jim Fenton
Cisco Systems, Inc.
Phishing and fraud are a problem, and message signatures
are the proposed
solution. Message signatures need to be easy to deploy; in
particular, they
need to overcome traditional public key infrastructure problems.
Message
signatures need the following features: moderate security;
the ability to
quickly revoke authorization, validity for at least the message
transit time
(approximately one week); the ability to survive common mail
transit behavior;
and minimize dependence on third parties for trust.
Messages are signed using a public key. Two steps are involved:
(1) authenticate message; (2) authorize
sender. Why not use existing protocols such
as S/MIME or PGP? Signature semantics are different; they
indicate identity,
not authorization. Also, use of a domain name is not always
under that domain's
control. Message signatures need to bind only to some message
headers, not all.
Messages need to be transparent to non-signature aware recipients.
Finally, key
management does not scale.
A message signature is placed by the domain owner and is
revocable by the domain
owner. It has only moderate security. A message signature
demonstrates
authorization to send mail. User signatures are placed by
the user and are
revocable by the user. They potentially have very high security.
What about anonymity? Signatures should be optional. Domains
may sign messages
without identifying the individual responsible for the message.
Signing
policies facilitate selective use of signatures. Keys can
be kept in DNS. This
approach makes a conscious trade-off of security for deployability.
-----------------------------------------------------------------------------------------------------
Looking Beyond Blocking
Rui Dai
Georgia Institute of Technology
Can we do better than "Never buy from spammers!"?
There is no effective way to
refine email lists. Current practices include:
(1) spammers send greater quantities of email messages (flooding);
(2) mail service providers collect information about deliverability;
(3) bond systems for trusted senders have emerged; and (4)
spammers are purchasing "refined" mailing lists
from companies such as LeadPlex in order to improve deliverability.
Currently, sender sends a message and the receiver may just
drop the message; the sender does not know what really happened
to the message.
To address this problem, establish arbiters. Arbiters refine
lists and send
messages. Arbiters maintain local cached filters which are
updated by input
from recipients. Arbiter provides refined list to sender.
This is not intended
as a final solution to the spam problem, but, rather a partial
solution.
-----------------------------------------------------------------------------------------------------
Leveraging Social Networks to Fight Spam, or, the
Physics of Email Messages
Oscar Boykin
University of Florida
Email headers identify social networks. Using the "from"
and "to" fields of
email messages, one can generate "subgraphs" illustrating
social networks. Spam
and ham subgraphs look different. Clustering coefficient is
derived from
triangles and wedges in subgraph. Social networks are formed
by social rules.
Because spammers do not follow the same rules, this technique
works. This
approach can be used for bootstrapping: classify fraction
of mail and train
filters. Bootstrapping results are okay, but there is room
for improvement.
Spammers could respond to this strategy either by using bcc
or by not using
multiple recipients. Email graphs can generate whitelists
and blacklists.
Future directions: consider more graph parameters; include
sent mail; look at
graphs of message IDs; better integration with content learning
based systems.
-----------------------------------------------------------------------------------------------------
Spam Kings
Brian McWilliams
Filters will never stop spam. Speaker disagrees with Paul
Graham who has stated
that if filters are good enough spammers will stop committing
email abuse. He
declared that "furtive shoppers like spam." In a
survey, 41% in the US admitted
to making a purchase as a result of spam they has received;
66% in Brazil had
made a purchase as a result of receiving spam. Some consumers
will respond to
spam even when it has been sent to the spam folder. This was
determined using
Google search and Yahoo! referrer logs. Simply segregating
spam will not stop
spammers. AOL and Hotmail disable hyperlinks in messages in
spam folder.
How paternalistic do ISPs want to be? ISPs advise users to
check spam folder
for false positives. Spam folder may become like circular
inserts in
newspapers; consumers still peruse segregated content. Spammers
are successful
because they are adaptive and opportunistic, not because they
are smart.
-----------------------------------------------------------------------------------------------------
People and Spam
John Graham-Cumming
Electric Cloud
Survey included 4,691 participants. 94.5% were men and 4%
were women. 42% had
used email for more than ten years, and 50% had used email
between six and ten
years. Vast majority of participants were email and system
administrators or
programmers. 96% believe that the spam problem will never
go away. 9% believe
filtering makes problem worse. Average user received 413 email
messages per
week; average user received 318 spam messages per week. 1%
made a purchase as a
result of spam. 16% have made a purchase as a result of physical
direct mail.
The survey showed that even highly experienced users will
make a purchase as a
result of spam!
Users spent an average of nine minutes per day dealing with
spam. Without spam
filters, 23% would stop or consider stopping use of email.
61% would have a
serious problem. 10% feel that spam violates their private
space; women were
three times more likely to feel that spam violated their privacy
then men. 42%
say they have lost mail as a result of spam filters.
Human accuracy was tested. Only "from" and "subject"
fields were visible.
Sample size was 260 messages and consisted of 80% spam. Participants
achieved
99.46% accuracy in correctly identifying spam vs. ham. They
achieved 100%
accuracy when one message was presented at a time. Test subjects
were four
times as likely to think a message was spam than vice versa.
Problems with survey: data is skewed heavily towards men
18-45 who are
experienced computer users; no attempt was mode to provide
statistically
meaningful data for the general population; and the survey
was not conducted by
a professional pollster.
-----------------------------------------------------------------------------------------------------
Report of the French Government's Approach to Fighting
Spam
Eric Walter, Constance Bommelaer
Services du Premier Ministre
The French government has adopted a multi-faceted approach
to fighting spam. In
order to initiate a public/private dialogue, a working group
was established
consisting of representatives from government, industry and
the public. One of
the objectives of this working group is to develop a definition
of spam. The
French government recognizes the need for international cooperation
in
addressing the spam problem. Other elements include: producing
technical
documents and increasing public awareness. Adoption of effective
legislation
following the EU directive which mandates an opt-in regime
is also part of the
French government's approach. Such legislation would also
prohibit the
concealing or disguising of the source, and would require
the inclusion of a
valid return address as well as an opt-out in each message.
Registration of
senders will also be required.
The French government has also launched a spam database project
which is
intended to be a source for international action against spammers.
Messages in
the database are analyzed, and if they are determined to be
in violation of the
law, they are referred to the appropriate law enforcement
agency. There have
been no prosecutions yet.
The speakers noted that approximately 80% of the spam received
by French
citizens is in English.
-----------------------------------------------------------------------------------------------------
Project Honeypot
Matthew Prince
John Marshall School of Law
According to a Pew study, harvesting is the primary way spammers
obtain email
addresses. Although CAN-SPAM prohibits harvesting, there is
no data regarding
enforcement. Project Honeypot has spam traps in thousands
of domains over 62
countries covering every continent, with millions of legitimate
appearing user
names. Project participants publish terms of service which
prohibit harvesting.
These terms of service are intended to serve as a binding
contract with site
visitors.
One objective of Project Honeypot is to track spammer behavior
and force them to
play defense. The project seeks to use spammer tactics against
spammers.
Another primary objective is to develop chains of evidence
to be used in
prosecution. In addition, data is freely shared with researchers.
4% of honeypot traffic is from harvesters. Harvesting machines
are generally
unprotected; they are much closer to the source of spam messages.
Proxies
generally are not used. On average, approximately 11 days
elapse between the
time an address is harvested and the time the first spam message
is received.
The shortest time is 23 seconds. Phishes are fast, but spammers
are slow.
Harvesting is concentrated in North American and Europe. Project
Honeypot is
now developing software to stop harvesters.
Additional information about Project Honeypot is available
at
www.projecthoneypot.org.
-----------------------------------------------------------------------------------------------------
You've Got Jail! Some First Hand Observations from
the Jeremy Jaynes Spam Trial
Jon Praed
Internet Law Group
Jaynes, who commonly used Gavin Stubblefield as an alias,
is a 29 year old
resident of Raleigh, NC. The trial lasted seven days. Jaynes
distributed spam
for Penny Stock Picker, Internet History Eraser, and FedEx
Refund Processor.
AOL reported receiving three million complaints. In just one
day AOL received
493,181 complaints concerning messages which originated from
1,862 different IP
addresses.
Virginia law prohibits falsification of routing information
*in any manner*.
Violation of law becomes a felony if more than 100,000 messages
are sent over a
one month period or one million messages are sent over a one
year period.
Jaynes was arrested on 12/11/03. Arresting officers found
lists of addresses on
computers as well as a list of anti-spammers. They also found
records of spam
deliveries. Handwritten notes were found in the trash. At
trial, Jaynes
stipulated that the notes were in his handwriting in order
to avoid detailed
discussion of notes by prosecution. Jaynes' merchant account
records showed
10,910 transactions per month, for a monthly income of approximately
$440,000.
Jaynes went offshore shortly after 7/1/03.
Information on ARIN and domain registration applications
was false, and this was
deemed falsification of routing information. UPS Stores were
helpful to
prosecution. Postal form 1583 required by law: merchant required
to verify
identity. No records were found for purported domain registrant
at that
address.
Prosecution faced difficult problem of proving that the email
messages sent by Jaynes were unsolicited. Relying on recipient
testimony was not viable because it was burdensome, indirect
and constituted hearsay because a single recipient could provide
no proof of multiple recipients. Therefore, prosecution turned
to expert testimony as used in drug trials. John Levine testified
that the messages sent by Jaynes were "not consistent
with solicited email practices." In support of this conclusion,
Levine cited (1) inconsistent from lines; (2) multiple IP
addresses; and (3) Belize domain names. Levine was untouchable
on cross examination.
The defense had no factual defense, and instead relied on
constitutional
challenges, citing the First Amendment and the Commerce Clause.
The defense
also raised jurisdictional issues and disputed the volume
of email messages sent
by Jaynes. After a day and a half of deliberation, the jury
found Jaynes guilty
of three felony counts. The Jaynes trial showed that juries
do understand the
technology, and demonstrated the importance of search and
seizure. The trial
also showed that proof of "unsolicited" was difficult,
but could be proven by
experts.
-----------------------------------------------------------------------------------------------------
Standardized Spam Filter Evaluation
Gordon Cormack
University of Waterloo
Standardized evaluation will aid in answering questions such
as (1) is spam filtering a viable approach?; (2) what are
the costs, risks and benefits of filter use?; (3) which filter
should be used?; (4) how can a better filter be made? The
TREC project (Text REtrieval Conference) of the National Institute
for Standards and Technology sponsored a conference whose
goal is to increase the availability of appropriate spam filter
evaluation techniques.
Generally speaking, spam filter usage involves the following
steps:
(1) filter classifies mail; (2) human addressee performs
triage on ham file; (3) ham is
read; (4) occasionally search for misclassified ham in spam
folder; (5) report misclassified email to filter. The first
step of filter evaluation involves simulating, i.e., replaying
incoming mail stream. In order to standardize the evaluation,
the messages in this stream must have a single recipient,
must be in chronological order, and must consist of the full
message including all headers. Second, the behavior of an
idealized user is simulated. This assumes that the user immediately
reports all ham misclassified as spam and all spam misclassified
as ham.
The speaker proposed a standardized filter interface which consisted of three commands: initialize, classify and train. Most filters compute
spamminess. If spamminess is greater than a threshold value, then the message
is classified as spam. Threshold value is arbitrary. Some ham is more likely
to be misclassified. Some ham is more likely to be missed. Some
ham is more valuable
and consequences of non-receipt may vary dramatically.
For additional information on speaker's research see
http://plg.uwaterloo.ca/~gvcormac/spam.
For additional information on TREC see
http://trec.nist.gov.
-----------------------------------------------------------------------------------------------------
Mail Avenger
David Mazieres
New York University
As a consequence of email design and architecture, any random
host may send
email, but only well-established servers can receive email.
It is time to
revisit email's design goals. Email should still be reliable,
although filters
often interfere with this goal. However, servers should no
longer accept email
from weakly connected, ephemeral clients. Mail Avenger has
the following
features:
(1) email is transmitted using SMTP protocol; (2)
recipients are put in control of SMTP responses; (3) users
are given extension addresses; (4) new policies are easy to
implement; and (5) uses existing MTA. It is possible to set
up scripts so that only messages from certain mailing lists
are accepted at certain addresses. It is also possible to
graylist any mail coming from a host running a Windows operating
system.
Spam should be filtered before responsibility for message
is accepted. A server
should not accept mail if the sender will not accept bounces.
There are several
problems with adoption of Mail Avenger:
(1) there has been no scalability testing; (2) there
are no statistics on false positives; (3) the callback conducted
by Mail Avenger might be interpreted as an address harvesting
attempt.
For additional information see http://www.mailavenger.org.
Isn't it time to decide which acquisition
services will help you build and manage your opt-in customer
database?
|