CountVectorizer matrix varies with new test data for classification?

CountVectorizer matrix varies with new test data for classification? - python

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.

You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.

You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)

try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.

I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()

I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Related

Converting a pandas Interval into a string (and back again)

I'm relatively new to Python and am trying to get some data prepped to train a RandomForest. For various reasons, we want the data to be discrete, so there are a few continuous variables that need to be discretized. I found qcut in pandas, which seems to do what I want - I can set a number of bins, and it will discretize the variable into that many bins, trying to keep the counts in each bin even.
However, the output of pandas.qcut is a list of Intervals, and the RandomForest classifier in scikit-learn needs a string. I found that I can convert an interval into a string by using .astype(str). Here's a quick example of what I'm doing:
import pandas as pd
from random import sample
vals = sample(range(0,100), 100)
cuts = pd.qcut(vals, q=5)
str_cuts = pd.qcut(vals, q=5).astype(str)
and then str_cuts is one of the variables passed into a random forest.
However, the intent of this system is to train a RandomForest, save it to a file, and then allow someone to load it at a later date and get a classification for a new test instance, that is not available at training time. And because the classifier was trained on discretized data, the new test instance will need to be discretized before it can be used. So what I want to be able to do is read in a new instance, apply the already-established discretization scheme to it, convert it to a string, and run it through the random forest. However, I'm getting hung up on the best way to 'apply the discretization scheme'.
Is there an easy way to handle this? I assume there's no straight-forward way to convert a string back into an Interval. I can get the list of all Interval values from the discretization (ex: cuts.unique()) and apply that at test-time, but that would require saving/loading a discretization dictionary alongside the random forest, which seems clunky, and I worry about running into issues trying to recreate a categorical variable (coming mostly from R, which is extremely particular about the format of categorical variables). Or is there another way around this that I'm not seeing?

Use the labelsargument in qcut and use pandas Categorical.
Either of those can help you create categories instead of interval for your variable. Then, you can use a form of encoding, for example Label Encoding or Ordinal Encoding to convert the categories (the factors if you're used to R) to numerical values which the Forest will be able to use.
Then the process goes :
cutting => categoricals => encoding
and you don't need to do it by hand anymore.
Lastly, some gradient boosted trees libraries have support for categorical variables though it's not a silver bullet and will depend on your goal. See catboost and lightgbm.

For future searchers, there are benefits to using transformers from scikit-learn instead of pandas. In this case, KBinsDiscretizer is the scikit equivalent of qcut.
It can be used in a pipeline, which will handle applying the previously-learned discretization to unseen data without the need for storing the discretization dictionary separately or round trip string conversion. Here's an example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer
pipeline = make_pipeline(KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'),
RandomForestClassifier())
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
If you really need to convert back and forth between pandas IntervalIndex and string, you'll probably need to do some parsing as described in this answer: https://stackoverflow.com/a/65296110/3945991 and either use FunctionTransformer or write your own Transformer for pipeline integration.

While it may not be the cleanest-looking method, converting a string back into an interval is indeed possible:
import pandas as pd
str_intervals = [i.replace("(","").replace("]", "").split(", ") for i in str_cuts]
original_cuts = [pd.Interval(float(i), float(j)) for i, j in str_intervals]

TFIDF with previously preprocessed data

I am trying to use several information retrieval techniques one after another. For each one i want the texts to be preprocessed in exactly the same way. My preprocessed texts are provided as a list of lists of words. Unfortunately scikit-learns TfidfVectorizer seems to only accept lists of strings. Currently i am doing it like this (which is of course very inefficient):
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = [["the","sun","is","bright"],["blue","is","the","sky"]]
tfidf = TfidfVectorizer(tokenizer=lambda i:i.split(","))
converted_train = map(lambda i:",".join(i), train_data)
result_train = tfidf.fit_transform(converted_train)
Is there a way to use scikit-learns TfidfVectorizer to perform information retrieval directly on this kind of preprocessed data?
If not, is it instead possible to let the TfidfVectorizer do the preprocessing and to reuse its preprocessed data afterwards?

I found the answer myself. My problem was, that I simply used None as the tokenizer of the TfidfVectorizer:
tfidf = TfidfVectorizer(tokenizer=None)
You have to instead use a tokenizer which just forwards the data. Also you have to make sure, the vectorizer does not convert the lists to lower case (which doesn't work). A working example is:
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = [["the","sun","is","bright"],["blue","is","the","sky"]]
tfidf = TfidfVectorizer(tokenizer=lambda i:i, lowercase=False)
result_train = tfidf.fit_transform(train_data)

Python: Creating Term Document Matrix from list

So I wanted to train a Naive Bayes Algorithm over some documents and the below code would just run fine if I had documents in the form of strings. But the issues is the strings I have goes through a series of pre-processing step which is more then stopword remove, lemmatization etc rather there are some custom conversion which returns a list of ngrams, where n can [1,2,3] depending on the context of text.
So now since I have list of ngram instead of a string representing a document I am confused how can I represent the same as an input to CountVectorizer.
Any suggestions?
Code that would work fine with docs as a document array of type string.
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(docs)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
classifier = BernoulliNB().fit(tfidf_data,op)

You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.

Show label probability/confidence in NLTK

I'm using the MaxEnt classifier from the Python NLTK library. For my dataset, I have many possible labels, and as expected, MaxEnt returns just one label. I have trained my dataset and get about 80% accuracy. I've also tested my model on unknown data items, and the results are good. However, for any given unknown input, I want to be able to print/display a ranking of all the possible labels based on some internal criteria MaxEnt used to select the one, such as confidence/probability. For example, suppose I had a,b,c as possible labels and I use MaxEnt.classify(input), I get currently one label, let's say c. However, I want to be able to view something like a (0.9), b(0.7), c(0.92), so I can see why c was selected, and possibly choose multiple labels based on those parameters. Apologies for my fuzzy terminology, I'm fairly new to NLP and machine learning.
Solution
Based on the accepted answer, here's a skeleton code example to demonstrate what I wanted and how it can be achieved. More classifier examples on the NLTK website.
import nltk
contents = read_data('mydataset.csv')
data_set = [(feature_sets(input), label) for (label, input) in contents] # User-defined feature_sets() function
train_set, test_set = data_set[:1000], data_set[1000:]
labels = [label for (input, label) in train_set]
maxent = nltk.MaxentClassifier.train(train_set)
maxent.classify(feature_sets(new_input)) # Returns one label
multi_label = maxent.prob_classify(feature_sets(new_input)) # Returns a DictionaryProbDist object
for label in labels:
multi_label.prob(label)

Try prob_classify(input)
It returns dictionary with probability for each label, see docs.

Python nltk ~ Handling many features with naivebayes without getting MemoryError or ValueError?

I have many short sentences with an associated label (that I want to predict). For example:
This is a short sentence, label1
For each of the short sentences, I am extracting all of the nouns and adjectives and using them as features for a naive bayes classifier. So for the sentence above:
short, sentence
Are the two words that would be extracted.
How many features?
In total I have 260,000 observations (50 MB csv file). From this I am able to extract about 7,103 unique nouns and adjectives. But using them all is just impractical. So I want to use the top N most frequently occuring words as features.
If I use the top 50 words my code runs just fine. Here is a snippet:
featuresets = [(document_features(i), i.label) for i in y]
train_set, test_set = featuresets[testTrainCutoff:], featuresets[:testTrainCutoff]
classifier = nltk.NaiveBayesClassifier.train(train_set)
Problem
If use the top 100 words though, I get 'MemoryError' when trying to build 'featuresets'. From another question on stackoverflow, I found a potential solution which was to "use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory" (source).
So I updated my code:
train_set, test_set = y[testTrainCutoff:], y[:testTrainCutoff]
train_set, test_set = apply_features(document_features,train_set), apply_features(document_features,test_set)
classifier = nltk.NaiveBayesClassifier.train(train_set)
However, now I get the following error:
File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
As a side note: I am running this on a 40 GB RAM machine, so I find it odd that I am getting these errors

I saw the same issue, and for me it was not related to system resources. Here is the solution for my case.
I was populating my feature object as as dict of features:
features = {<features>}
instead of as a tuple containing a dict and a classification value:
features = ({<features>},classification)
No matter how small I made the feature array, it always gave me
ValueError: too many values to unpack
with this incorrect data structure. Once I changed to a tuple, everything worked fine. Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.