I am building a basic NLP program using nltk and sklearn. I have a large dataset in a database and I am wondering what the best way to train the classifier is.
Is it advisable to download the training data in chunks and pass each chunk to the classifier? Is that even possible, or would I be overwriting what was learned from the previous chunk?
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
while True:
training_set, proceed = download_chunk() # pseudo
trained = SklearnClassifier(MultinomialNB()).train(training_set)
if not proceed:
break
How is this normally done? I want to avoid keeping the database connection open for too long.
The way you're doing it right now will actually just overwrite the classifier for each chunk in your training data as you're creating a new SklearnClassifier object each time. What you need to do is instantiate the SklearnClassifier prior to getting into the training loop. However, looking at the code here, it appears that the NLTK SklearnClassifier uses the fit method of the underlying Sklearn model. This means that you can't actually update a model once it is trained. What you need to do is instantiate the Sklearn model directly and use the partial_fit method. Something like this should work:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() # must instantiate classifier outside of the loop or it will just get overwritten
while True:
training_set, proceed = download_chunk() # pseudo
clf.partial_fit(training_set)
if not proceed:
break
At the end, you'll have a MultinomialNB() classifier that has been trained on each chunk of your data.
Typically, if the whole dataset will fit in memory, it is somewhat more performant to just download the whole thing and call fit once (in which case you could actually use the nltk SklearnClassifier). See the notes about the partial_fit method here. However, if you are unable to fit the entire set in memory, it is certainly common practice to train on chunks of the data. You can do this by making several calls to the database or by extracting all of the information from the database, placing it in a CSV on your hard drive, and reading chunks of it from there.
Note
If you're using a shared database with other users, the DBAs may prefer you to extract all of it at once as once as this would (probably) take up fewer DB resources than making several separate, smaller calls to the database would.
Related
I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)
Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?
This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)
It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)
After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.
I am relatively new to logistic regression using SciKit learn in Python. After reading some topics and viewing some demo's, I decided to dive in myself.
So, basically, I am trying to predict the conversion rate of customers, based on some features. The outcome is either Active (1) or Not active (0). I tried KNN and logistic regression. With KNN I get an average accuracy of 0.893 and with logistic regression 0.994. The latter seems so high, is that even realistic / possible?
Anyway: Suppose that my model is indeed very accurate, I would now like to import a new dataset with the same feauture columns and predict their conversions (they end this month). In the case above I used cross_val_score to get the accuracy scores.
Do I now need to import the new set, somehow fit that new set to this model. (not training it again, now I just want to use it)
Can someone please inform me how I can proceed? If additional info is needed, please comment on that.
Thanks in advance!
For the statistic question: of course, it can happen, either your data is having little noise or the scenario Clock Slave mentioned in the comments.
For the import of the classifier, you could pickle it ( save it as a binary with the pickle module, and then just load it whenever you need it and use the clf.predict() method on the new data
import pickle
#Do the classification and name the fitted object clf
with open('clf.pickle', 'wb') as file :
pickle.dump(clf,file,pickle.HIGHEST_PROTOCOL)
And then later you can load it
import pickle
with open('clf.pickle', 'rb') as file :
clf =pickle.load(file)
# Now predict on the new dataframe df as
pred = clf.predict(df.values)
Beside 'Pickle', 'joblib' can be used as well.
##
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
assume there X,Y, already defined
model = LogisticRegression()
model.fit(X, Y)
save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)
load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
I'm trying to train a NLTK classifier for sentiment analysis and then save the classifier using pickle.
The freshly trained classifier works fine. However, if I load a saved classifier the classifier will either output 'positive', or 'negative' for ALL examples.
I'm saving the classifier using
classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier.classify(words_in_tweet)
f = open('classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
and loading the classifier using
f = open('classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
classifier.classify(words_in_tweet)
I'm not getting any errors.
Any idea what the problem could be, or how to debug this correctly?
The most likely place a pickled classifier can go wrong is with the feature extraction function. This must be used to generate the feature vectors that the classifier works with.
The NaiveBayesClassifier expects feature vectors for both training and classification; your code looks as if you passed the raw words to the classifier instead (but presumably only after unpickling, otherwise you wouldn't get different behavior before and after unpickling). You should store the feature extraction code in a separate file, and import it in both the training and the classifying (or testing) script.
I doubt this applies to the OP, but some NLTK classifiers take the feature extraction function as an argument to the constructor. When you have separate scripts for training and classifying, it can be tricky to ensure that the unpickled classifier successfully finds the same function. This is because of the way pickle works: pickling only saves data, not code. To get it to work, just put the extraction function in a separate file (module) that your scripts import. If you put in in the "main" script, pickle.load will look for it in the wrong place.
Is it possible to delete or insert a step in a sklearn.pipeline.Pipeline object?
I am trying to do a grid search with or without one step in the Pipeline object. And wondering whether I can insert or delete a step in the pipeline. I saw in the Pipeline source code, there is a self.steps object holding all the steps. We can get the steps by named_steps(). Before modifying it, I want to make sure, I do not cause unexpected effects.
Here is a example code:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)
clf
Is it possible that we do something like steps = clf.named_steps(), then insert or delete in this list? Does this cause undesired effect on the clf object?
I see that everyone mentioned only the delete step. In case you want to also insert a step in the pipeline:
pipe.steps.append(['step name',transformer()])
pipe.steps works in the same way as lists do, so you can also insert an item into a specific location:
pipe.steps.insert(1,['estimator',transformer()]) #insert as second step
Based on rudimentary testing you can safely remove a step from a scikit-learn pipeline just like you would any list item, with a simple
clf_pipeline.steps.pop(n)
where n is the position of the individual estimator you are trying to remove.
Just chiming in because I feel like the other answers answered the question of adding steps to a pipeline really well, but didn't really cover how to delete a step from a pipeline.
Watch out with my approach though. Slicing lists in this instance is a bit weird.
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
estimators = [('reduce_dim', PCA()), ('poly', PolynomialFeatures()), ('svm', SVC())]
clf = Pipeline(estimators)
If you want to create a pipeline with just steps PCA/Polynomial you can just slice the list step by indexes and pass it to Pipeline
clf1 = Pipeline(clf.steps[0:2])
Want to just use steps 2/3?
Watch out these slices don't always make the most amount of sense
clf2 = Pipeline(clf.steps[1:3])
Want to just use steps 1/3?
I can't seem to do using this approach
clf3 = Pipeline(clf.steps[0] + clf.steps[2]) # errors
Yes, that's possible, but you must fulfill same requirements which Pipeline requires at initialization, i.e. you cannot insert predictor in any step except last, you should call fit after you update Pipeline.steps, because after such update all steps (maybe they were learned in previous fit calls) will be invalidated, also last step of Pipeline should always implement fit method, all previous steps should implement fit_transform.
So yes, it will work in current codebase, but i think it's not a good solution for your task, it makes your code more dependent on current implementation of Pipeline, i think it's more convenient to create new Pipeline with modified steps, because Pipeline will at least validate all your steps in initialization, also creating new Pipeline will not significantly differ in terms of speed from modifying steps of existing pipeline, but as i've just said - creation of new Pipeline after each modification of steps is safer in case when someone will significantly change implementation of Pipeline.