I am trying to write a script where I will calculate the similarity of few documents. I want to do it by using LSA. I have found the following code and change it a bit. I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. I want to do the same similarity calculation but only with sklearn library. Is that possible?
from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''
class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords.words('english')
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower()
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
return -1*self.Vt
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
mylsa = LSA(stopwords, ignorechars)
for t in titles:
mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)
From #ogrisel's answer:
I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. That's why I think that it is something wrong :P
dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)
cosine_similarity(X)
You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform on your database of documents and then call the transform method (from the same TruncatedSVD method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity and numpy.argsort the result to find the index of most similar document.
Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp).
Related
I'm trying to get cosine similarity for 2 sets of data (with unequal lengths).
Test set contains 4 random similar images from google.
Training set contains 1 similar image to test set from google.
Following the code im using to do the same by converting image to vectors and calculating cosine similarity
import os
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity
from img_to_vec import Img2Vec
import numpy as np
test_path = '/Users/Desktop/img_vec/test'
train_path = '/Users/Desktop/img_vec/train'
print("Getting vectors for test images...\n")
img2vec = Img2Vec()
# For each test image, we store the filename and vector as key, value in a dictionary
pics = {}
for file in os.listdir(test_path):
filename = os.fsdecode(file)
img = Image.open(os.path.join(test_path, filename))
vec = img2vec.get_vec(img)
pics[filename] = vec
# print (pics)
pic_name = {}
for file1 in os.listdir(train_path):
filename1 = os.fsdecode(file1)
img1 = Image.open(os.path.join(train_path, filename1))
vec1 = img2vec.get_vec(img1)
pic_name[filename1] = vec1
# print(pic_name)
vec1 = np.array([pics])
vec2 = np.array([pic_name])
sims = {}
for key in list(pics.keys()):
print(key)
sims[key] = cosine_similarity(vec1[vec2].reshape((1, -1)), vec1[key].reshape((1, -1)))[0][0]
d_view = [(v, k) for k, v in sims.items()]
d_view.sort(reverse=True)
for v, k in d_view:
print(v, k)
However, I'm unable to resolve the following error:
sims[key] = cosine_similarity(vec1[vec2].reshape((1, -1)), vec1[key].reshape((1, -1)))[0][0]
IndexError: arrays used as indices must be of integer (or boolean) type
I tried to compute cosine similarity in Python manually (using numpy) by using a specialised library. It doesn't work. I believe it's an issue with dtype.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)
Any help / guidance / alternative is much appreciated.
vec1 = np.array([pics])
vec2 = np.array([pic_name])
I don't see the need to do this.
Also, in the line where error is coming, the error is present at:
vec1[vec2].reshape((1, -1))
because you're indexing vec1 using vec2. I suppose you mean to put key instead of vec2.
I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented using python.
I finally compiled my code for GSDMM and will put it here from scratch for others' use. I have tried to comment on important parts:
# Imports
import random
import numpy as np
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
from gsdmm import MovieGroupProcess
# data
data = ...
# stop words
stop_words = ...
# turning sentences into words
data_words =[]
for doc in data:
doc = doc.split()
data_words.append(doc)
# create vocabulary
vocabulary = ...
# Removing stop Words
stop_words.extend(['from', 'rt'])
def remove_stopwords(texts):
return [
[
word
for word in simple_preprocess(str(doc))
if word not in stop_words
]
for doc in texts
]
data_words_nostops = remove_stopwords(vocabulary)
# building bi-grams
bigram = Phrases(vocabulary, min_count=5, threshold=100)
bigram_mod = Phraser(bigram)
print('done!')
# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]
# lemmatization
pos_to_use = ['NOUN', 'ADJ', 'VERB', 'ADV']
data_lemmatized = []
for sent in data_words_bigrams:
doc = nlp(" ".join(sent))
data_lemmatized.append(
[token.lemma_ for token in doc if token.pos_ in pos_to_use]
)
docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)
# Train a new model
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)
# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)
def top_words(cluster_word_distribution, top_cluster, values):
for cluster in top_cluster:
sort_dicts = sorted(
mgp.cluster_word_distribution[cluster].items(),
key=lambda k: k[1],
reverse=True,
)[:values]
print('Cluster %s : %s'%(cluster,sort_dicts))
print(' — — — — — — — — — ')
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)
# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)
# Show the top 10 words in term frequency for each cluster
top_words(mgp.cluster_word_distribution, top_index, 10)
Links
gensim modules
https://radimrehurek.com/gensim/models/phrases.html#module-gensim.models.phrases
https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_preprocess
Python library gsdmm
GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text
clustering model. It is essentially a modified LDA (Latent Drichlet
Allocation) which suppose that a document such as a tweet or any other
text encompasses one topic.
GSDMM
LDA
Address: github.com/da03/GSDMM
import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math
class GSDMM:
def __init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
self.n_topics = n_topics
self.n_iter = n_iter
self.random_state = random_state
np.random.seed(random_state)
self.alpha = alpha
self.beta = beta
def fit(self, X):
alpha = self.alpha
beta = self.beta
D, V = X.shape
K = self.n_topics
N_d = X.sum(axis=1)
words_d = {}
for d in range(D):
words_d[d] = find(X[d,:])[1]
# initialization
N_k = np.zeros(K)
M_k = np.zeros(K)
N_k_w = lil_matrix((K, V), dtype=np.int32)
K_d = np.zeros(D)
for d in range(D):
k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
K_d[d] = k
M_k[k] = M_k[k]+1
N_k[k] = N_k[k] + N_d[d]
for w in words_d[d]:
N_k_w[k, w] = N_k_w[k,w]+X[d,w]
for iter in range(self.n_iter):
print 'iter ', iter
for d in range(D):
k_old = K_d[d]
M_k[k_old] -= 1
N_k[k_old] -= N_d[d]
for w in words_d[d]:
N_k_w[k_old, w] -= X[d,w]
# sample k_new
log_probs = [0]*K
for k in range(K):
log_probs[k] += math.log(alpha+M_k[k])
for w in words_d[d]:
N_d_w = X[d,w]
for j in range(N_d_w):
log_probs[k] += math.log(N_k_w[k,w]+beta+j)
for i in range(N_d[d]):
log_probs[k] -= math.log(N_k[k]+beta*V+i)
log_probs = np.array(log_probs) - max(log_probs)
probs = np.exp(log_probs)
probs = probs/np.sum(probs)
k_new = np.random.choice(K, 1, p=probs)[0]
K_d[d] = k_new
M_k[k_new] += 1
N_k[k_new] += N_d[d]
for w in words_d[d]:
N_k_w[k_new, w] += X[d,w]
self.topic_word_ = N_k_w.toarray()
As I understand it you have the code https://github.com/rwalk/gsdmm but you need to decide how to apply it.
How does it work?
You can download the paper A dirichlet multinomial mixture model-based approach for short text clustering, it shows that the clusters search is equivalent to game of table choosing. Image to have a group of students and want to group them on tables by their movie interest. Every student (=item) switches in each round to a table(=cluster) that has students with similar movies and is popular. Alpha controls a factor that decides how easily a table gets removed when it's empty (low alpha = less tables). Small betas means that a table is chosen based on similarity to the table than based on popularity of a table. For short text clustering you take word instead of movies.
Alpha, beta, number of iterations
Therefore low alpha result in many clusters with single words, while high alphas result in less clusters with more words. High beta result in popular clusters while low beta result in similar clusters (which are not strong populated). What parameteres you need is based on the dataset. The number of clusters can mostly controlled by beta, but alpha has also (as described) an influence. The number of iterations seems to be stable after 20 iterations but 10 is also ok.
Data preperation process
Before you train the algorithm you will need to create a clean data set. For this you convert every text to lower case, you remove non-ASCII characters and stop-words and you apply stemming or lemmatisation. You will also need to apply this process when you execute it on a new sample.
The following code tests KMeans for several n_clusters and tries to find the "best" n_clusters by the inertia criterion. However, it is not reproducible: even fixing random_state, every time I call kmeans(df) on the same dataset, it generates different clustering - and even different n_clusters. Am I missing something here?
from sklearn.cluster import KMeans
from tqdm import tqdm_notebook
def kmeans(df):
inertia = []
models = {}
start = 3
end = 40
for i in tqdm_notebook(range (start, end)):
k = KMeans(n_clusters=i, init='k-means++', n_init=50, random_state=10, n_jobs=-1).fit(df.values)
inertia.append(k.inertia_)
models[i] = k
ep = np.argmax(np.gradient(np.gradient(np.array(inertia)))) + start
return models[ep]
I am having this same issue. I think that a closer solution is to freeze the model into a file and import the model and then cluster a new predict phrase, I think if the vectorizer and kmeans clustering is initialized every single time the program it will run, it seems to order the clusters in a different order every time and the hashmap will not activate correclty and give you a different number every time the function is called
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
# Sample array of string sentences
df = pd.read_csv('/workspaces/codespaces-flask//data/shuffled.csv')
df = shuffle(df)
sentences = df['text'].values
# Convert the sentences into TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
# Perform K-Means clustering
kmeans = KMeans(n_clusters=8, random_state=42)
clusters = kmeans.fit_predict(X)
output = zip(sentences, clusters)
# Print the cluster assignments for each sentence
for sentence, cluster in zip(sentences, clusters):
print("Sentence:", sentence, "Cluster:", cluster)
df = pd.DataFrame(output)
db_file_name = '/workspaces/codespaces-flask/ThrAive/data/database1.db'
conn = sqlite3.connect(db_file_name)
cursor = conn.cursor()
cursor.execute("SELECT journal_text FROM Journal JOIN User ON Journal.id
= user.id
rows = cursor.fetchall()
conn.commit()
conn.close()
df1 = pd.DataFrame(rows)
df1 = df1.applymap(lambda x: " ".join(x.split()) if isinstance(x, str)
else x)
entry = df1
entry = entry
print(entry)
entry = entry[0].iloc[-1].lower()
entry = [entry]
new_X = vectorizer.transform(entry)
# Predict the cluster assignments for the new sentences
new_clusters = kmeans.predict(new_X)
for entry, new_cluster in zip(entry, new_clusters):
print("Sentence:", entry, "Cluster:", new_cluster)
zipper = zip(entry, new_clusters)
df = pd.DataFrame(zipper)
df = df.applymap(lambda x: " ".join(x.split()) if isinstance(x, str)
else x)
df = df.to_string( header=False, index=False)
entry = df
output = entry
numbers = ['0', '1', '2', '3', '4','5','6','7','8']
names =
# Create a dictionary that maps numbers to names
number_to_name = {number: name for number, name in zip(numbers, names)}
print(output[-1])
output = number_to_name[output[-1]]
json_string = json.dumps(str(output))
I think that the solution is saving the model to disk
import pickle
# Train a scikit-learn model
model = ///
# Save the model to disk
with open('model.pkl', 'wb') as file:
pickle.dump(model, file)
and then load the pickle file and test it on the k-means without re-initializing the cluster.
I am practicing with building an article summarizer. I built something using the script below. I would like to export the model and use it for deployment but can't find a way around it.
Here is the script for the analyzer.
#import necessary libraries
import re
import gensim
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
file = open("somefile.txt","r")
data=file.readlines()
file.close()
#define preprocessing steps
#lower case
#remove everything inside []
#remove 's
#fetch only ascii characters
def preprocessor(text):
newString = text.lower()
newString = re.sub("[\(\[].*?[\)\]]", "", newString)
newString = re.sub("'s","",newString)
newString = re.sub("[^'0-9.a-zA-Z]", " ", newString)
tokens=newString.split()
return (" ".join(tokens)).strip()
#call above function
text=[]
for i in data:
text.append(preprocessor(i))
all_sentences=[]
for i in text:
sentences=i.split(".")
for i in sentences:
if(i!=''):
all_sentences.append(i.strip())
# tokenizing the sentences for training word2vec
tokenized_text = []
for i in all_sentences:
tokenized_text.append(i.split())
#define word2vec model
model_w2v = gensim.models.Word2Vec(
tokenized_text,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 0, # 1 for cbow model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
#train word2vec
model_w2v.train(tokenized_text, total_examples= len(tokenized_text), epochs=model_w2v.epochs)
#define function to obtain sentence embedding
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
#call above function
wordvec_arrays = np.zeros((len(tokenized_text), 200))
for i in range(len(tokenized_text)):
wordvec_arrays[i,:] = word_vector(tokenized_text[i], 200)
# similarity matrix
sim_mat = np.zeros([len(wordvec_arrays), len(wordvec_arrays)])
#compute similarity score
for i in range(len(wordvec_arrays)):
for j in range(len(wordvec_arrays)):
if i != j:
sim_mat[i][j] = cosine_similarity(wordvec_arrays[i].reshape(1,200), wordvec_arrays[j].reshape(1,200))[0,0]
#Generate a graph
nx_graph = nx.from_numpy_array(sim_mat)
#compute pagerank scores
scores = nx.pagerank(nx_graph)
#sort the scores
sorted_x = sorted(scores.items(), key=lambda kv: kv[1],reverse=True)
sent_list=[]
for i in sorted_x:
sent_list.append(i[0])
#extract top 10 sentences
num=10
summary=''
for i in range(num):
summary=summary+all_sentences[sent_list[i]]+'. '
print(summary)
I want to have an exported model that I can pass to a flask API later. I need help with that.
Using pickle
To dump the model:
pickle.dump(model, open(filename, 'wb'))
To read the model:
loaded_model = pickle.load(open(filename, 'rb'))
Some other packages like Tensorflow have their own dump/load method, refer to their doc in those cases.
So I am trying to build a natural learning processor in python, and I was using some code I found online, then adapting my own stuff to it. But now, it just doesnt want to work. It keeps giving me
ValueError: Found array with 0 sample(s) (shape=(0, 262)) while a minimum of 1 is required.
Here is my code. I apologize if it is messy I just copied it straight off the internet:
from collections import Counter
import pandas
from nltk.corpus import stopwords
import pandas as pd
import numpy
headlines = []
apps = pd.read_csv('DataUse.csv')
for e in apps['title_lower']:
headlines.append(e)
testdata = pd.read_csv('testdata.csv')
# Find all the unique words in the headlines.
unique_words = list(set(" ".join(headlines).split(" ")))
def make_matrix(headlines, vocab):
matrix = []
for headline in headlines:
# Count each word in the headline, and make a dictionary.
counter = Counter(headline)
# Turn the dictionary into a matrix row using the vocab.
row = [counter.get(w, 0) for w in vocab]
matrix.append(row)
df = pandas.DataFrame(matrix)
df.columns = unique_words
return df
print(make_matrix(headlines, unique_words))
import re
# Lowercase, then replace any non-letter, space, or digit character in the headlines.
new_headlines = [re.sub(r'[^\w\s\d]','',h.lower()) for h in headlines]
# Replace sequences of whitespace with a space character.
new_headlines = [re.sub("\s+", " ", h) for h in new_headlines]
unique_words = list(set(" ".join(new_headlines).split(" ")))
# We've reduced the number of columns in the matrix a bit.
print(make_matrix(new_headlines, unique_words))
stopwords = set(stopwords.words('english'))
stopwords = [re.sub(r'[^\w\s\d]','',s.lower()) for s in stopwords]
unique_words = list(set(" ".join(new_headlines).split(" ")))
# Remove stopwords from the vocabulary.
unique_words = [w for w in unique_words if w not in stopwords]
# We're down to 34 columns, which is way better!
print(make_matrix(new_headlines, unique_words))
##
##
##
##
from sklearn.feature_extraction.text import CountVectorizer
# Construct a bag of words matrix.
# This will lowercase everything, and ignore all punctuation by default.
# It will also remove stop words.
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(headlines)
# We created our bag of words matrix with far fewer commands.
print(matrix.todense())
# Let's apply the same method to all the headlines in all 100000 submissions.
# We'll also add the url of the submission to the end of the headline so we can take it into account.
full_matrix = vectorizer.fit_transform(apps['title_lower'])
print(full_matrix.shape)
##
##
##
##
##
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Convert the upvotes variable to binary so it works with a chi-squared test.
col = apps["total_shares"].copy(deep=True)
col_mean = col.mean()
col[col < col_mean] = 0
col[(col > 0) & (col > col_mean)] = 1
print col
# Find the 1000 most informative columns
selector = SelectKBest(chi2, k='all')
selector.fit(full_matrix, col)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = full_matrix[:,top_words[0]]
##
##
##
##
##
##
import numpy as numpy
transform_functions = [
lambda x: len(x),
lambda x: x.count(" "),
lambda x: x.count("."),
lambda x: x.count("!"),
lambda x: x.count("?"),
lambda x: len(x) / (x.count(" ") + 1),
lambda x: x.count(" ") / (x.count(".") + 1),
lambda x: len(re.findall("\d", x)),
lambda x: len(re.findall("[A-Z]", x)),
]
# Apply each function and put the results into a list.
columns = []
for func in transform_functions:
columns.append(apps["title_lower"].apply(func))
# Convert the meta features to a numpy array.
meta = numpy.asarray(columns).T
##
##
##
##
##
##
##
features = numpy.hstack([chi_matrix.todense()])
from sklearn.linear_model import Ridge
import random
train_rows = 262
# Set a seed to get the same "random" shuffle every time.
random.seed(1)
# Shuffle the indices for the matrix.
indices = list(range(features.shape[0]))
random.shuffle(indices)
# Create train and test sets.
train = features[indices[:train_rows], :]
test = features[indices[train_rows:], :]
print test
train_upvotes = apps['total_shares'].iloc[indices[:train_rows]]
test_upvotes = apps['total_shares'].iloc[indices[train_rows:]]
train = numpy.nan_to_num(train)
print (test)
# Run the regression and generate predictions for the test set.
reg = Ridge(alpha=.1)
reg.fit(train, train_upvotes)
predictions = reg.predict(test)
##
##
##
##
##
### We're going to use mean absolute error as an error metric.
### Our error is about 13.6 upvotes, which means that, on average,
### our prediction is 13.6 upvotes away from the actual number of upvotes.
##print(sum(abs(predictions - test_upvotes)) / len(predictions))
##
### As a baseline, we'll use the average number of upvotes
### across all submissions.
### The error here is 17.2 -- our estimate is better, but not hugely so.
### There either isn't a ton of predictive value encoded in the
### data we have, or we aren't extracting it well.
##average_upvotes = sum(test_upvotes)/len(test_upvotes)
##print(sum(abs(average_upvotes - test_upvotes)) / len(predictions))
##
EDIT: Here is the error:
Traceback (most recent call last):
File "C:/Users/Tucker Siegel/Desktop/Machines/Test.py", line 156, in <module>
predictions = reg.predict(test)
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 200, in predict
return self._decision_function(X)
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 183, in _decision_function
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 262)) while a minimum of 1 is required.