HashingVectorizer vs TfidfVectorizer export file size

HashingVectorizer vs TfidfVectorizer export file size - python

I am generating a model using the below:
from sklearn.linear_model import SGDClassifier
text_clf = OnlinePipeline([('vect', HashingVectorizer()),
('clf-svm', SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=5, random_state=None)),
])
When I export this model using the below:
from sklearn.externals import joblib
joblib.dump(text_clf, 'text_clf.joblib')
My text_clf.joblib is 45MB. When I replace HashingVectorizer() with TfidfVectorizer() and re-export my model is 9kb.
Why is there such a file difference and is there anyway to reduce the size of the HashingVectorizer export.

HashingVectorizer is stateless, so does not keep anything in memory. Its the number of features that are being passed from HashingVectorizer to the SGDClassifier.
By default the number of features calculated from the data is
n_features=1048576
So, SGDClassifier will have to save coef_, intercept_ etc variables for all these features. And this will increase if your problem is multi-class. For classes greater than 2, the storage will increase by number of classes times.
Need more details about TfidfVectorizer features. What is the size of TfidfVectorizer.vocabulary_ in that case where its size is just 9kb? You can access that by:
len(text_clf.named_steps['vect'].vocabulary_)

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.

This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.

Getting feature_importances_ after getting optimal TPOT pipeline?

I've read through a few pages but need someone to help explain how to make this work for.
I'm using TPOTRegressor() to get an optimal pipeline, but from there I would love to be able to plot the .feature_importances_ of the pipeline it returns:
best_model = TPOTRegressor(cv=folds, generations=2, population_size=10, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
I saw this kind of set up from a now closed issue on Github, but currently I get the error:
Best pipeline: LassoLarsCV(input_matrix, normalize=True)
Traceback (most recent call last):
File "main2.py", line 313, in <module>
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
AttributeError: 'LassoLarsCV' object has no attribute 'feature_importances_'
So, how would I get these feature importances from the optimal pipeline, regardless of which one it lands on? Or is this even possible? Or does someone have a better way of going about trying to plot feature importances from a TPOT run?
Thanks!
UPDATE
For clarification, what is meant by Feature Importance is the determination of how important each feature (X's) of your dataset is in determining the predicted (Y) label, using a barchart to plot each feature's level of importance in coming up with its predictions. TPOT doesn't do this directly (I don't think), so I was thinking I'd grab the pipeline it came up with, re-run it on the training data, and then somehow use a .feature_imprtances_ to then be able to graph the feature importances, as these are all sklearn regressor's I'm using?

Very nice question.
You just need to fit again the best model in order to get the feature importances.
best_model.fit(X_train, Y_train)
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
The last line returns the best model based on the CV.
You can then use:
exctracted_best_model.fit(X_train, Y_train)
to train it. If the best model has the desired attribure, then you will be able to access it after exctracted_best_model.fit(X_train, Y_train)
More details (in my comments) and a Toy example:
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# reduce training features for time sake
X_train = X_train[:100,:]
y_train = y_train[:100]
# Fit the TPOT pipeline
tpot = TPOTRegressor(cv=2, generations=5, population_size=50, verbosity=2)
# Fit the pipeline
tpot.fit(X_train, y_train)
# Get the best model
exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
print(exctracted_best_model)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square',
n_estimators=100, random_state=None)
# Train the `exctracted_best_model` using THE WHOLE DATASET.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X, y) # X,y IMPORTNANT
# Access it's features
exctracted_best_model.feature_importances_
# Plot them using barplot
# Here I fitted the model on X_train, y_train and not on the whole dataset for TIME SAKE
# So I got importances only for the features in `X_train`
# If you use `exctracted_best_model.fit(X, y)` we will have importances for all the features !!!
positions= range(exctracted_best_model.feature_importances_.shape[0])
plt.bar(positions, exctracted_best_model.feature_importances_)
plt.show()
IMPORTNANT NOTE: *In the above example, the best model based on the pipeline was AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square'). This model indeed has the attribute feature_importances_.
In the case where the best model does not have an attribute feature_importances_, the exact same code will not work. You will need to read the docs and see the attributes of each returned best model. E.g. if the best model was LassoCV then you would use the coef_ attribute.
Output:

How to use the imbalanced library with sklearn pipeline?

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?

It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)
model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)
So X_new are the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.
I am getting the following error:
All intermediate steps should be transformers and implement fit and transform,
ExtraTreesClassifier ...

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.
Without reading between the lines too much about what you're trying to do here, this would work for you:
First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading:
Pipelining: chaining a PCA and a logistic regression
Sample pipeline for text feature extraction and evaluation

You may also get the error in the title if you were oversampling or undersampling your data using imblearn module and fitting it into a model in a pipeline. If you got this message, then it means you have imported sklearn.pipeline.Pipeline. Import imblearn.pipeline.Pipeline instead and you're golden. For example,
from imblearn.pipeline import Pipeline
pipe = Pipeline([('o', SMOTE()), ('svc', SVC())])
The problem is, if you're sampling your data, the intermediate steps obviously need to sample the data as well, which is not supported by sklearn's Pipeline but is supported by imblearn's Pipeline.

This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.
m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.
So change the order of the pipeline and add another transformer for the first step in the pipeline

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data?
I have the following sample program from the scikit-learn website:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)
Edit: if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle (nor by joblib), then using Neuraxle's custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis. The savers are called for each step if defined upon saving, and otherwise joblib is used as default for steps without a saver.

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.
Joblib is included in scikit-learn:
>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier
>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target) # evaluate training error
0.9526989426822482
>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)
>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482
Edit: in Python 3.8+ it's now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.
So you have initialized your classifier and trained it for a long time with
clf = some.classifier()
clf.fit(X, y)
After this you have two options:
1) Using Pickle
import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
pickle.dump(clf, f)
# and later you can load it
with open('filename.pkl', 'rb') as f:
clf = pickle.load(f)
2) Using Joblib
from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl')
# and later you can load it
clf = joblib.load('filename.pkl')
One more time it is helpful to read the above-mentioned links

In many cases, particularly with text classification it is not enough just to store the classifier but you'll need to store the vectorizer as well so that you can vectorize your input in future.
import pickle
with open('model.pkl', 'wb') as fout:
pickle.dump((vectorizer, clf), fout)
future use case:
with open('model.pkl', 'rb') as fin:
vectorizer, clf = pickle.load(fin)
X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)
Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:
vectorizer.stop_words_ = None
to make dumping more efficient.
Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:
clf.sparsify()
Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:
clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)
and then you can store it more efficiently.

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:
def __getstate__(self):
try:
state = super(BaseEstimator, self).__getstate__()
except AttributeError:
state = self.__dict__.copy()
if type(self).__module__.startswith('sklearn.'):
return dict(state.items(), _sklearn_version=__version__)
else:
return state
The recommended method to save your model to disc is to use the pickle module:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
pickle.dump(model,f)
However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).
From the documentation:
In order to rebuild a similar model with future versions of
scikit-learn, additional metadata should be saved along the pickled
model:
The training data, e.g. a reference to a immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.
If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:
In the specific case of the scikit, it may be more interesting to use
joblib’s replacement of pickle (joblib.dump & joblib.load), which is
more efficient on objects that carry large numpy arrays internally as
is often the case for fitted scikit-learn estimators, but can only
pickle to the disk and not to a string:

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:
/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15:
FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will
be removed in 0.23. Please import this functionality directly from
joblib, which can be installed with: pip install joblib. If this
warning is raised when loading pickled models, you may need to
re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
Therefore, you need to install joblib:
pip install joblib
and finally write the model to disk:
import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)
with open('myClassifier.joblib.pkl', 'wb') as f:
joblib.dump(clf, f, compress=9)
Now in order to read the dumped file all you need to run is:
with open('myClassifier.joblib.pkl', 'rb') as f:
my_clf = joblib.load(f)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HashingVectorizer vs TfidfVectorizer export file size - python

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

Getting feature_importances_ after getting optimal TPOT pipeline?

How to use the imbalanced library with sklearn pipeline?

All intermediate steps should be transformers and implement fit and transform

Save classifier to disk in scikit-learn

Categories

Resources