fast way to tokenize data in Python like TfidfVectorizer does - python

i used TfidfVectoriser to create tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(smooth_idf=False,stop_words=stopwords.words('english'))
tfidf = transformer.fit_transform(raw_documents=sentences)
Now I want to transform each element of my sentences list to list of tokens, which were used in tfidfvectoriser. I tried to extract it directly from tfidf object via this function
def get_token(id, transformer_name, tfidf_obj):
return(np.array(transformer_name.get_feature_names())[\
tfidf_obj[id].toarray().reshape((tfidf_obj.shape[1],))!=0])
Where id is index of each sentence. In this function i tried to extract given row from tf-idf object, find non-zero elements and extract corresponding elements from transformer_name.get_feature_names() . Looks too complex =). Also this solution works very slow =/
Is there any way to get tokens using tfidfvectorizer's preprocessing and tokenization functions?

Related

How to Compare Sentences with an idea of the positions of keywords?

I want to compare the two sentences. As a example,
sentence1="football is good,cricket is bad"
sentence2="cricket is good,football is bad"
Generally these senteces have no relationship that means they are different meaning. But when I compare with python nltk tools it will give 100% similarity. How can I fix this Issue? I need Help.
Yes wup_similarity internally uses synsets for single tokens to calculate similarity
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
Since ancestor nodes for cricket and football would be same. wup_similarity will return 1.
If you want to fix this issue using wup_similarity is not a good choice.
Simplest token based way would be fitting a vectorizer and then calculating similarity.
Eg.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = ["football is good,cricket is bad", "cricket is good,football is bad"]
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(corpus)
x1 = vectorizer.transform(["football is good,cricket is bad"])
x2 = vectorizer.transform(["cricket is good,football is bad"])
cosine_similarity(x1, x2)
There are more intelligent methods to meaure semantic similarity though. One of them which can be tried easily is Google's USE Encoder.
See this link
Semantic Similarity is a bit tricky this way, since even if you use context counts (which would be n-grams > 5) you cannot cope with antonyms (e.g. black and white) well enough. Before using different methods, you could try using a shallow parser or dependency parser for extracting subject-verb or subject-verb-object relations (e.g. ), which you can use as dimensions. If this does not give you the expected similarity (or values adequate for your application), use word embeddings trained on really large data.

TFIDF with previously preprocessed data

I am trying to use several information retrieval techniques one after another. For each one i want the texts to be preprocessed in exactly the same way. My preprocessed texts are provided as a list of lists of words. Unfortunately scikit-learns TfidfVectorizer seems to only accept lists of strings. Currently i am doing it like this (which is of course very inefficient):
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = [["the","sun","is","bright"],["blue","is","the","sky"]]
tfidf = TfidfVectorizer(tokenizer=lambda i:i.split(","))
converted_train = map(lambda i:",".join(i), train_data)
result_train = tfidf.fit_transform(converted_train)
Is there a way to use scikit-learns TfidfVectorizer to perform information retrieval directly on this kind of preprocessed data?
If not, is it instead possible to let the TfidfVectorizer do the preprocessing and to reuse its preprocessed data afterwards?
I found the answer myself. My problem was, that I simply used None as the tokenizer of the TfidfVectorizer:
tfidf = TfidfVectorizer(tokenizer=None)
You have to instead use a tokenizer which just forwards the data. Also you have to make sure, the vectorizer does not convert the lists to lower case (which doesn't work). A working example is:
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = [["the","sun","is","bright"],["blue","is","the","sky"]]
tfidf = TfidfVectorizer(tokenizer=lambda i:i, lowercase=False)
result_train = tfidf.fit_transform(train_data)

Python: Creating Term Document Matrix from list

So I wanted to train a Naive Bayes Algorithm over some documents and the below code would just run fine if I had documents in the form of strings. But the issues is the strings I have goes through a series of pre-processing step which is more then stopword remove, lemmatization etc rather there are some custom conversion which returns a list of ngrams, where n can [1,2,3] depending on the context of text.
So now since I have list of ngram instead of a string representing a document I am confused how can I represent the same as an input to CountVectorizer.
Any suggestions?
Code that would work fine with docs as a document array of type string.
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(docs)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
classifier = BernoulliNB().fit(tfidf_data,op)
You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.
You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.
You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)
try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.
I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()
I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

TfidfVectorizer does not use the whole set of words in all documents?

I am trying to build a TFIDF model with TfidfVectorizer. The feature name list namely the number of column of sparse matrix is shorter than the length of word set of documents even though I set min_df as 1. What happened?
Did you check the stop_words and max_features? If you provide values in either of these two, it will exclude some words.

Categories

Resources