I am using SciKit Learn to perform some analytics on a large dataset (+- 34.000 files). Now I was wondering. The HashingVectorizer aims on low memory usage. Is it possible to first convert a bunch of files to HashingVectorizer objects (using pickle.dump) and then load all these files together and convert them to TfIdf features? These features can be calculated from the HashingVectorizer, because counts are stored and the number of documents can be deduced. I now have the following:
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
with open(path, 'wb') as handle:
pickle.dump(features, handle)
Then, loading the files is trivial:
data = []
for path in paths:
with open(path, 'rb') as handle:
data.append(pickle.load(handle))
tfidf = TfidfVectorizer()
tfidf.fit_transform(data)
But, the magic does not happen. How can I let the magic happen?
It seems the problem is you are trying to vectorizing your text twice. Once you have built a matrix of counts, you should be able to transform the counts to tf-idf features using sklearn.feature_extraction.text.TfidfTransformer instead of TfidfVectorizer.
Also, it appears your saved data is a sparse matrix. You should be stacking the loaded matrices using scipy.sparse.vstack() instead of passing a list of matrices to TfidfTransformer
I'm quite worried by your loop
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
Each time you re-fit your vectoriser, maybe it will forget its vocabulary, and so the entries in each vector won't correspond to the same words (not sure about this i guess it depends on how they do the hashing); why not just fit it on the whole corpus, i.e.
features = vectorizer.fit_transform(texts)
For you actual question, it sounds like you are just trying to normalise the columns of your data matrix by the IDF; you should be able to do this directly on the arrays (i've converted to numpy arrays since I can't work out how the indexing works on the scipy arrays). The mask DF != 0 is necessary since you used the hashing vectoriser which has 2^20 columns:
import numpy as np
X = np.array(features.todense())
DF = (X != 0).sum(axis=0)
X_TFIDF = X[:,DF != 0]/DF[DF != 0]
Related
I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)
I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:
vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).
What would be a good way to insert the vectorized features into the pandas dataframe?
Return term-document matrix after learning the vocab dictionary from the raw documents.
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names_out())
Concatenate the original df and the count_vect_df columnwise.
pd.concat([df, count_vect_df], axis=1)
If your base data frame is df, all you need to do is:
import pandas as pd
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)
I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000) to limit the number of features.
I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus.
Now I want to analyse my selected corpus, especially I want to know frequency of tokens in selected vocabulary. But I am unable to find an easy way to do it. So kindly help me in this regard.
When you call fit_transform() a sparse matrix will be returned.
To display it you simply have to call the toarray() method.
vec = CountVectorizer()
spars_mat = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
#you can observer the matrix in the interpretor by doing
spars_mat.toarray()
With the help of #bernard post, I am able to completely get the result, which is as follows:
vec = CountVectorizer()
doc_term_matrix = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
doc_term_matrix = doc_term_matrix.toarray()
term_freq_matrix = doc_term_matrix.sum(0)
min_freq = np.amin(term_freq_matrix)
indices_name_mapping = vec.get_feature_names()
feature_names = [indices_name_mapping[i] for i, x in enumerate(term_freq_matrix) if x == min_freq]
I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.
{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}
This is what i have tried so far,
import numpy as np
import json
from sklearn.naive_bayes import *
topic = []
question = []
excerpt = []
with open('training.json') as f:
for line in f:
data = json.loads(line)
topic.append(data["topic"])
question.append(data["question"])
excerpt.append(data["excerpt"])
unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]
x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( ['hello'] )
But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?
All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.
vect = TfidfVectorizer()
X = vect.fit_transform(X)
Obviously, you'll need to vectorize your test set in the same way
clf.predict( vect.transform(['hello']) )
See a tutorial on using sklearn with textual data.
I have the input data from an excel file, that I have processed in the manner below using nltk:
rb = open_workbook('subjectcat.xlsx')#C:/Users/5460/Desktop/
wb = copy(rb) #making a copy
sheet = rb.sheet_by_index(0)
data = ()
for row_index in range(1,500): #train using 500
temp,add = (),()
subject,cat = 0,0 #trial
for col_index in range(1,3):
if col_index==1:
#print col_index
subject = sheet.cell(row_index,col_index).value
#print subject
#print cellname(row_index,col_index)
subject = "'" + subject
#temp +=(subject,)
#print temp
elif col_index==2:
#print col_index
cat = sheet.cell(row_index,col_index).value
#print cat
#print cellname(row_index,col_index)
cat = "'" + cat + "'"
add = add + (subject,cat)
#print (add)
data = data + (add,)
print 'done'
training_data = list(data)
training_data = training_data[1:][::2] #removing the even items
I have to now proceed to use scikit-learn to train the classifier. I have read through many tutorials for svm online, but they all seem to use different ways of creating datasets for use. I would be grateful if anyone can give me tips on how to proceed, as I am stuck for now. I am training classifier to classify emails into categories. Thanks in advance!
Wrap your input data as a 2D numpy array: one row per sample / instance / observation. The column of the array should store the numerical descriptors (features) for the samples.
You need to store output / target classes as another numpy array of integers. Each target class should be assigned a integer (e.g. 0 for "ham" and 1 for "spam").
The output / target classes array should have as many entries as there are rows in your input data (one label per sample).
Read the documentation of numpy if you don't know how transform Python list to numpy arrays. You can start here:
http://docs.scipy.org/doc/numpy/user/basics.creation.html
To get a good predictive accuracy for SVM you also need to make sure that your features are meaningful (for instance don't use a string or integer representations to encode categorical input feature, but use a one-hot-encoding feature expansion) and standardize your data to center and scale to unit variance. In particular have a look at:
http://scikit-learn.org/stable/modules/preprocessing.html
Edit: I had not seen you last statement: if your input data is raw email text, you have to extract features (numerical descriptors that statistically summarize the email content). You will need text feature extraction in that case:
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction