Characterization of strange words for spam mail classification and development of application methods
Title
辞書にない語のスパムメール分類性能解析と応用手法の開発
Characterization of strange words for spam mail classification and development of application methods
Degree
博士(理学)
Dissertation Number
創科博甲第126号
(2023-10-11)
Degree Grantors
Yamaguchi University
[kakenhi]15501
grid.268397.1
Abstract
Many mail filtering methods have been proposed, but they have not yet achieved perfect filtering. One of the reasons for this is the influence of modified words created by spammers to slip through the mail filtering, in which words are modified by insert symbols, spaces, HTML tags, etc. For example,“ price$ for be$t drug$! ”,“ priceC I A L I S ”, “ <font>se</font>xu<font>al</font> ”, etc. These are frequently replaced with new strings by changing the combination of symbols ,HTML tags etc.
Mail filtering is a technique that captures trends in words in training mails (mails received in the past) and applies these trends to words in test mails (newly received emails). Some of the above modified words appear in both training and test mails, i.e., words that could be used as features of spam mail by using them unprocessed, while others appear only in test mails, i.e., words that have not been learned and require special processing (e.g., removal of symbols, search for similar words, etc.) for their use. However, existing methods do not make these distinctions and treat them in the same way.
Therefore, in order to bring the filtering performance of the existing methods closer to perfect filtering, we developed a method in which the above modified words are separated into words that appear in both training and test mails and words that appear only in test mails, and each of these words is used for mail filtering.
In this study, we treat the above modified words as ”strange words”. Typical examples of such strange words include, in addition to the above, new words included in ham mails, proper nouns used in close relationships, and abbreviations.
The results of this study are as follows
(1) In order to compare the filtering performance between strange words and other words, filtering experiments were conducted using existing methods with strange words, nouns, verbs, and adjectives. The results showed that the filtering performance of the strange words was the best. This means that strange words have a significant impact on the filtering performance, and we expect to improve the filtering performance of existing methods by developing a new method to utilize strange words.
(2) In order to examine the breakdown of strange words, we counted the number of words that appeared in both training and test mails, and the number of words that appeared only in test mails. The results were compared with those obtained for nouns, verbs and adjectives. We found that there are a significant number of strange words that appear in both training and test mails, but only in one of the groups, i.e., ham or spam mail. Words with this appearance pattern are most useful for mail filtering. On the other hand, we found that there are many strange words that appear only in test mails, i.e., words that cannot be learned. We expect to improve the filtering performance by separating these strange words and developing a new method to use each of them.
(3) For the use of strange words, we developed (A) a method for using words that appear in both training and test mails, and (B) a method for using words that appear only in test mails, respectively.
(A) To examine the breakdown of strange words that appear in both training and test mails, we divided them into two categories: words that appear only in ham and spam mails, i.e., words with patterns that improve filtering performance, and words that do not, and examined their frequency of occurrence. The results showed that the words with appearance patterns that improve filtering performance tend to appear more frequently than those without such patterns. This means that by using words with a certain number of occurrences in filtering, it is possible to use more words that improve filtering performance. We developed a method to do this and conducted experiments with different threshold values to find the optimal value, and confirmed that setting the threshold around 7 improves filtering performance.
(B) We compared the number of strange words that appear only in the test mails between ham and spam mails, and found that the number tends to be higher in spam mail than in ham mail. In order to utilize this difference for filtering, we proposed a method to set a uniform spam probability for strange words that appear only in the test mails, and attempted to find the optimal spam probability. As a result, setting the spam probability to 0.7 improved the filtering accuracy from 98.2% to 98.9%.
By using (A) and (B) above together, both words that appear in both training and test mails and words that appear only in test mails can be used for mail filtering to increase accuracy.
Mail filtering has been improved and its performance has reached its limit. In order to further improve accuracy, i.e., to approach perfect filtering, a new perspective is needed, and this paper provides one such perspective: the use of strange words.
This paper is organized as follows.
In Chapter 1, we review the background of mail filtering methods, discuss how spammers use strange words to slip through such filters. The purpose and structure of this paper are then presented.
In Chapter 2, we will discuss related research on examples of filtering methods that have been proposed so far are given.
In Chapter 3, we describe the mail datasets, word handling, and strange words used in the this paper. This is followed by an explanation of the ROC curve, which is the measure used to evaluate the filtering performance, and explanation of scatter plots and box-and-whisker plots.
In Chapter 4, we compare the filtering performance between strange words and other words, and show that strange words have a significant impact on the filtering performance. Furthermore, based on the results of a breakdown of the number of strange words, we discuss the possibility of improving filtering performance by separating words that appear in both training and test mails from those that appear only in the test mails. We will work on this in the next chapters and report the results.
In Chapter 5, we develop a method to use (A) above, i.e., strange words that appear in both training and test mails. From the results of counting the number of words used in the subject and body of each email, we show that the number tends to be smaller for words that degrade the filtering performance. Based on these results, we propose a method that sets a threshold for the number of words used in the subject and body of mails, and uses only those words that exceed the threshold for classification. Experiments are conducted to find the optimal value by varying the threshold, and the effect of this method on performance is reported.
In Chapter 6, we develop a method to use (B) above, i.e., strange words that appear only in the test mails. We compare the number of types of these words in ham and spam mails, and show that the number tends to be larger in spam mails, and that this feature can be used as a bias for detecting spam mails. In this paper, we deal with experiments using bsfilter and develop a method to set spam probabilities uniformly for strange words that appear only in the test mails. After searching for the optimal spam probability, we report that a spam probability of 0.7 greatly improves the filtering performance.
In Chapter 7, we describes the processing flow combining the methods developed in Chapter 5 and Chapter 6. The paper is then summarized, including future prospects.
Mail filtering is a technique that captures trends in words in training mails (mails received in the past) and applies these trends to words in test mails (newly received emails). Some of the above modified words appear in both training and test mails, i.e., words that could be used as features of spam mail by using them unprocessed, while others appear only in test mails, i.e., words that have not been learned and require special processing (e.g., removal of symbols, search for similar words, etc.) for their use. However, existing methods do not make these distinctions and treat them in the same way.
Therefore, in order to bring the filtering performance of the existing methods closer to perfect filtering, we developed a method in which the above modified words are separated into words that appear in both training and test mails and words that appear only in test mails, and each of these words is used for mail filtering.
In this study, we treat the above modified words as ”strange words”. Typical examples of such strange words include, in addition to the above, new words included in ham mails, proper nouns used in close relationships, and abbreviations.
The results of this study are as follows
(1) In order to compare the filtering performance between strange words and other words, filtering experiments were conducted using existing methods with strange words, nouns, verbs, and adjectives. The results showed that the filtering performance of the strange words was the best. This means that strange words have a significant impact on the filtering performance, and we expect to improve the filtering performance of existing methods by developing a new method to utilize strange words.
(2) In order to examine the breakdown of strange words, we counted the number of words that appeared in both training and test mails, and the number of words that appeared only in test mails. The results were compared with those obtained for nouns, verbs and adjectives. We found that there are a significant number of strange words that appear in both training and test mails, but only in one of the groups, i.e., ham or spam mail. Words with this appearance pattern are most useful for mail filtering. On the other hand, we found that there are many strange words that appear only in test mails, i.e., words that cannot be learned. We expect to improve the filtering performance by separating these strange words and developing a new method to use each of them.
(3) For the use of strange words, we developed (A) a method for using words that appear in both training and test mails, and (B) a method for using words that appear only in test mails, respectively.
(A) To examine the breakdown of strange words that appear in both training and test mails, we divided them into two categories: words that appear only in ham and spam mails, i.e., words with patterns that improve filtering performance, and words that do not, and examined their frequency of occurrence. The results showed that the words with appearance patterns that improve filtering performance tend to appear more frequently than those without such patterns. This means that by using words with a certain number of occurrences in filtering, it is possible to use more words that improve filtering performance. We developed a method to do this and conducted experiments with different threshold values to find the optimal value, and confirmed that setting the threshold around 7 improves filtering performance.
(B) We compared the number of strange words that appear only in the test mails between ham and spam mails, and found that the number tends to be higher in spam mail than in ham mail. In order to utilize this difference for filtering, we proposed a method to set a uniform spam probability for strange words that appear only in the test mails, and attempted to find the optimal spam probability. As a result, setting the spam probability to 0.7 improved the filtering accuracy from 98.2% to 98.9%.
By using (A) and (B) above together, both words that appear in both training and test mails and words that appear only in test mails can be used for mail filtering to increase accuracy.
Mail filtering has been improved and its performance has reached its limit. In order to further improve accuracy, i.e., to approach perfect filtering, a new perspective is needed, and this paper provides one such perspective: the use of strange words.
This paper is organized as follows.
In Chapter 1, we review the background of mail filtering methods, discuss how spammers use strange words to slip through such filters. The purpose and structure of this paper are then presented.
In Chapter 2, we will discuss related research on examples of filtering methods that have been proposed so far are given.
In Chapter 3, we describe the mail datasets, word handling, and strange words used in the this paper. This is followed by an explanation of the ROC curve, which is the measure used to evaluate the filtering performance, and explanation of scatter plots and box-and-whisker plots.
In Chapter 4, we compare the filtering performance between strange words and other words, and show that strange words have a significant impact on the filtering performance. Furthermore, based on the results of a breakdown of the number of strange words, we discuss the possibility of improving filtering performance by separating words that appear in both training and test mails from those that appear only in the test mails. We will work on this in the next chapters and report the results.
In Chapter 5, we develop a method to use (A) above, i.e., strange words that appear in both training and test mails. From the results of counting the number of words used in the subject and body of each email, we show that the number tends to be smaller for words that degrade the filtering performance. Based on these results, we propose a method that sets a threshold for the number of words used in the subject and body of mails, and uses only those words that exceed the threshold for classification. Experiments are conducted to find the optimal value by varying the threshold, and the effect of this method on performance is reported.
In Chapter 6, we develop a method to use (B) above, i.e., strange words that appear only in the test mails. We compare the number of types of these words in ham and spam mails, and show that the number tends to be larger in spam mails, and that this feature can be used as a bias for detecting spam mails. In this paper, we deal with experiments using bsfilter and develop a method to set spam probabilities uniformly for strange words that appear only in the test mails. After searching for the optimal spam probability, we report that a spam probability of 0.7 greatly improves the filtering performance.
In Chapter 7, we describes the processing flow combining the methods developed in Chapter 5 and Chapter 6. The paper is then summarized, including future prospects.
Creators
Temma Seiya
Languages
jpn
Resource Type
doctoral thesis
File Version
Version of Record
Access Rights
open access