DBSCAN. Detecting spam emails using fuzzy hashing - python

There is a task detecting spam emails using fuzzy hashes. After watching a bunch of videos and reading no less articles, I came to this algorithm:
I read the dataset, process it, delete the None values, duplicates.
import pandas as pd
import numpy as np
import ppdeep as pp
from sklearn.cluster import DBSCAN
init_data = pd.read_csv('./spam2.csv')
data = init_data[['Label', 'Body']]
data.dropna(inplace=True)
data = data.rename(columns={'Body': 'message', 'Label': 'class'})
data = data.reindex(columns=['class', 'message'])
data = data.drop_duplicates(subset='message', keep="first")
The next step, I process the messages themselves, I'm not sure about the need for it. Although if I don't skip this stage, then more emails with the same fuzzy hash appear, which is logical.
import re
import html
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# nltk.download('stopwords')
# nltk.download('punkt')
EMAIL_RE = re.compile("[\w.+-]+#[\w-]+\.[\w.-]+")
URLS_RE = re.compile("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:\%[0-9a-fA-F][0-9a-fA-F]))+")
PUNCTUATION_RE = re.compile(r'[!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?#\[\\\]\^_`\{\|\}\~]')
NOT_LETTERS_OR_SPACE_RE = re.compile("[^A-Za-z ]")
REPEATING_LETTERS_RE = re.compile(r'([a-z])\1{2,}')
def prepare_message(message):
# Convert to lower case
text = message.lower()
# Сonverts HTML codes into characters
text = html.unescape(text)
# Remove email
text = re.sub(EMAIL_RE, ' ', text)
# Remove urls
text = re.sub(URLS_RE, ' ', text)
# Remove all punctuation symbols
text = re.sub(PUNCTUATION_RE, ' ', text)
# Remove all except letters and space
text = re.sub(NOT_LETTERS_OR_SPACE_RE, '', text)
# Replace repeating letters
text = re.sub(REPEATING_LETTERS_RE, r'\1', text)
# Split by space and stemming with PorterStemmer
ps = PorterStemmer()
return ' '.join([ps.stem(word) for word in text.split()])
data['message'] = data['message'].apply(prepare_message)
After this stage, I get processed words separated by a space, without numbers and any other characters.
Next, I add a column with a fuzzy hash of each message.
data['hash'] = data['message'].apply(pp.hash)
Next, I divide the dataset into a training and test sample (although, as I found out, new values cannot be predicted in DBSCAN, only added to the distance matrix and trained in a new way, because in DBSCAN there are no static cluster centers, as for example in KMeans)
x = data['hash']
y = data['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, stratify=y)
In the next step, I calculate the distance matrix between each pair of messages.
import numba
#numba.jit(parallel=True, cache=True, fastmath=True)
def calc_distances(x_train, x):
count = 0
n = len(x_train)
max_count = (n**2 - n) // 2
for i in range(n):
for j in range(i):
x[i,j] = pp.compare(x_train[i], x_train[j])
x[j,i] = x[i,j]
count += 1
print(f"\r{count}/{max_count}", end='')
n = len(x_train)
distances = np.zeros((n, n))
calc_distances(list(x_train), distances)
np.fill_diagonal(distances, 100.0)
Here I did not come up with and did not find any ideas to speed up the calculation of the matrix, so I sit waiting for 30 minutes, if with the entire dataset then 70 :(
The final stage. Clustering.
db = DBSCAN(eps=0.5, min_samples=2, min_metric='precomputed').fit(distances)
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
At this stage, I'm having problems. No matter how I change the parameter eps and min_samples, I always get 1 cluster and 0 noise at the output.
Please tell me what exactly am I wrong about? What exactly is the problem? Maybe use some other algorithm besides 'DBSCAN'? or what? There is very little information on the Internet, however, specifically about spam detection using fuzzy hashes.
I tried several options. I tried not to process the message, that is, I left them as they are in the dataset. Instead of pp.compare, I tried using fuzz.ratio comparison. I tried to change the eps and min_samples parameters, but I still get 1 cluster at the output.

Related

Clustering similar words with a pretrained word2vec and K means

I am quite new at programming and word2vec models and I would really appreciate your help.
I have performed a BoW analysis on my data and have obtained the top 100 most predictive words. Now, I want to run those words through a pretrained word2vec model and organize them into different clusters using K means. The main goal is to establish the most predictive clusters. However, when I do the clustering, something goes wrong as my model gives me clusters containing single letters (though my 100 words are actual words and my data does not have single letters).
Below is a snippet of the code I am using (note that I have used a tutorial that has similar steps as these):
text = np.take(vectorizer2.get_feature_names(), pos_class_prob_sorted[:100]) #these are the extracted 100 predictive words from the BoW analysis
text = str(text)
tokenized_docs = word_tokenize(text)
tokenized_docs = list(tokenized_docs)
list_of_docs = text
#Trying to make clusters
def vectorize(list_of_docs, model=wv):
"""Generate vectors for list of documents using a Word Embedding
Args:
list_of_docs: List of documents
model: Gensim's Word Embedding
Returns:
List of document vectors
"""
features = []
for tokens in list_of_docs:
zero_vector = np.zeros(model.vector_size)
vectors = []
for token in tokens:
if token in wv:
try:
vectors.append(wv[token])
except KeyError:
continue
if vectors:
vectors = np.asarray(vectors)
avg_vec = vectors.mean(axis=0)
features.append(avg_vec)
else:
features.append(zero_vector)
return features
vectorized_docs = vectorize(tokenized_docs,model=wv)
len(vectorized_docs), len(vectorized_docs[0])
def mbkmeans_clusters(
X,
k,
mb,
print_silhouette_values,
):
"""Generate clusters and print Silhouette metrics using MBKmeans
Args:
X: Matrix of features.
k: Number of clusters.
mb: Size of mini-batches.
print_silhouette_values: Print silhouette values per cluster.
Returns:
Trained clustering model and labels based on X.
"""
km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
print(f"For n_clusters = {k}")
print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
print(f"Inertia:{km.inertia_}")
if print_silhouette_values:
sample_silhouette_values = silhouette_samples(X, km.labels_)
print(f"Silhouette values:")
silhouette_values = []
for i in range(k):
cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append(
(
i,
cluster_silhouette_values.shape[0],
cluster_silhouette_values.mean(),
cluster_silhouette_values.min(),
cluster_silhouette_values.max(),
)
)
silhouette_values = sorted(
silhouette_values, key=lambda tup: tup[2], reverse=True
)
for s in silhouette_values:
print(
f" Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
)
return km, km.labels_
clustering, cluster_labels = mbkmeans_clusters(
X=vectorized_docs,
k=10,
mb=500,
print_silhouette_values=True,
)
df_clusters = pd.DataFrame({
"tokens": [" ".join(text) for text in tokenized_docs],
"cluster": cluster_labels
})
print("Most representative terms per cluster (based on centroids):")
for i in range(10):
tokens_per_cluster = ""
most_representative = wv.most_similar(positive=[clustering.cluster_centers_[i]], topn=5)
for t in most_representative:
tokens_per_cluster += f"{t[0]} "
print(f"Cluster {i}: {tokens_per_cluster}")
Can you please help me determine where am I going wrong with this code? For me, it seems that the model does not actually take those top 100 words into account and it seems like they are not going through the word2vec model.

NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysis

while performing sentiment analysis using data -
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
The dataset contains 25K training and testing data (12.5 Positive and 12.5 Negative reviews)
I'm constantly getting -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Code -
(Required libraries and Variable names are initialized separately)
To create training and testing data -
import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
texts,labels = [],[]
for idx,label in enumerate(folders):
for fname in glob.glob(os.path.join(path, label, '*.*')):
texts.append(open(fname, 'r',encoding="utf8").read())
labels.append(idx)
# stored as np.int8 to save space
return texts, np.array(labels).astype(np.int8)
trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)
len(trn),len(trn_y),len(val),len(val_y)
len(trn_y[trn_y==1]),len(val_y[val_y==1])
np.unique(trn_y)
Count Vectorization -
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()
#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]
Here i get Error -
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
vocab = loaded_vectorizer.get_feature_names()
loaded_vectorizer is not defined anywhere in this code, so it's not surprising that it's not initialized.
Also why do you initialize veczr twice? Apparently you don't use it the second time.

Get values from K-Means clusters using dataframe

I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')

Pattern recognition with Pybrain

Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set
Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.

doc2vec How to cluster DocvecsArray

I've patched the following code from examples I've found over the web:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from random import shuffle
# classifier
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
print(model.docvecs)
my test.txt file contains a paragraph per line.
The code runs fine and generates DocvecsArray for each line of text
my goal is to have an output like so:
cluster 1: [DOC_5,DOC_100,...DOC_N]
cluster 2: [DOC_0,DOC_1,...DOC_N]
I have found the following Answer, but the output is:
cluster 1: [word,word...word]
cluster 2: [word,word...word]
How can I alter the code and get document clusters?
So it looks like you're almost there.
You are outputting a set of vectors. For the sklearn package, you have to put those into a numpy array - using the numpy.toarray() function would probably be best. The documentation for KMeans is really stellar and even across the whole library it's good.
A note for you is that I have had much better luck with DBSCAN than KMeans, which are both contained in the same sklearn library. DBSCAN doesn't require you to specify how many clusters you want to have on the output.
There are well-commented code examples in both links.
In my case I used:
for doc in docs:
doc_vecs = model.infer_vector(doc.split())
# creating a matrix from list of vectors
mat = np.stack(doc_vecs)
# Clustering Kmeans
km_model = KMeans(n_clusters=5)
km_model.fit(mat)
# Get cluster assignment labels
labels = km_model.labels_
# Clustering DBScan
dbscan_model = DBSCAN()
labels = dbscan_model.fit_predict(mat)
Where model is the pre-trained Doc2Vec model. In my case I didn't need to cluster the same documents of the training but new documents saved in the docs list

Categories

Resources