Using scikit learn, I have trained my model but dont know how to use the model to predict new text passages. I have watched tons of tutorials but none of them go beyond training and testing. Below is the code Im using
data_source_url = "/path/to/file.csv"
airline_tweets = pd.read_csv(data_source_url)
features = airline_tweets.iloc[:, 10].values
labels = airline_tweets.iloc[:, 1].values
processed_features = []
# I do some text processing here and then append the text to processed_features
vectorizer = CountVectorizer(analyzer = 'word', lowercase = False)
features = vectorizer.fit_transform(processed_features)
features_nd = features.toarray() # for easy usage
X_train, X_test, y_train, y_test = train_test_split(features_nd, labels, train_size=0.80, random_state=1234)
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)
predictions = log_model.predict(X_test)
Basically, you just need to follow the same step to transform your new dataset. Then, use your trained model to predict. It looks something like that:
new_dataset = ... # read your new dataset
new_processed_features = []
# do the same text processing here
# Use the same vectorizer to transform your new dataset
new_features = vectorizer.transform(new_processed_features)
new_features_nd = new_features.toarray() # for easy usage
# Use your trained model to predict new dataset
new_predictions = log_model.predict(new_features_nd)
Related
I have several logs that are manipulated by two different TfIdfVectorizer objects.
The first one reads and splits the log in ngrams
with open("my/log/path.txt", "r") as test:
corpus = [test.read()]
tf = TfidfVectorizer(ngram_range=ngr)
corpus_transformed = tf.fit_transform(corpus)
infile.close()
The resulting data is written in a Pandas dataframe that has 4 columns
(score [float], review [ngrams of text], isbad [0/1], kfold [int]).
Initial kfold value is -1.
where I have:
my_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf.get_feature_names()).transpose()
For cross-validation I split the dataset in test and train with StratifiedKFolds by doing a simple:
ngr=(1,2)
for fold_ in range(5):
# 'reviews' column has short sentences expressing an opinion
train_df = df[df.kfold != fold_].reset_index(drop=True)
test_df = df[df.kfold == fold_].reset_index(drop=True)
new_tf = TfidfVectorizer(ngram_range=ngr)
new_tf.fit(train_df.reviews)
xtrain = new_tf.transform(train_df.reviews)
xtest = new_tf.transform(test_df.reviews)
And only after this double tfidf transformation I fit my SVC model with:
model.fit(xtrain, train_df.isbad) # where 'isbad' column is 0 if negative and 1 if positive
preds = model.predict(xtest)
accuracy = metrics.accuracy_score(test_df.isbad, preds)
So at the end of the day I have my model that classifies reviews in both classes (negative-0 or positive-1), I dump my model and both tfidf vectorizers (tf and new_tf) but when it comes to new data, even if I do:
with open("never/seen/data.txt", "r") as unseen: # load REAL SAMPLE data
corpus = [unseen.read()]
# to transform the unseen data I use one of the dumped tfidf's obj
corpus_transformed = tf_dump.transform(corpus)
unseen.close()
my_unseen_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf_dump.get_feature_names()).transpose()
my_unseen_df = my_unseen_df.sample(frac=1).reset_index(drop=True) # randomize rows
# to transform reviews' data that are going to be classified I use the new_tf dump, like before
X = new_tf_dump.transform(my_unseen_df.reviews)
# use the previously loaded model and make predictions
res = model_dump.predict(X)
#print(res)
I got ValueError: X has 604,969 features, but SVC is expecting 605,424 as input, but how is that possibile if I manipulate the data with the same objects? What am I doing wrong here?
I want to use my trained model as a classifier for new, unseen data. Isn't this the right way to go?
Thank you.
I am trying to do some bad case analysis on my product categorization model using SHAP. My data looks something like this:
corpus_train, corpus_test, y_train, y_test = train_test_split(data['Name_Description'],
data['Category_Target'],
test_size = 0.2,
random_state=8)
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word')
X_train = vectorizer.fit_transform(corpus_train)
X_test = vectorizer.transform(corpus_test)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
X_train_sample = shap.sample(X_train, 100)
X_test_sample = shap.sample(X_test, 20)
masker = shap.maskers.Independent(data=X_test_sample)
explainer = shap.LinearExplainer(model, masker=masker)
shap_values = explainer.shap_values(X_test_sample)
X_test_array = X_test_sample.toarray()
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names(), class_names=data['Category'].unique())
Now to save space I didn't include the actual summary plot, but it looks fine. My issue is that I want to be able to analyze a single prediction and get something more along these lines:
In other words, I want to know which specific words contribute the most to the prediction. But when I run the code in cell 36 in the image above I get an
AttributeError: 'numpy.ndarray' object has no attribute 'output_names'
I'm still confused on the indexing of shap_values. How can I solve this?
I was unable to find a solution with SHAP, but I found a solution using LIME. The following code displays a very similar output where its easy to see how the model made its prediction and how much certain words contributed.
c = make_pipeline(vectorizer, classifier)
# saving a list of strings version of the X_test object
ls_X_test= list(corpus_test)
# saving the class names in a dictionary to increase interpretability
class_names = list(data.Category.unique())
# Create the LIME explainer
# add the class names for interpretability
LIME_explainer = LimeTextExplainer(class_names=class_names)
# explain the chosen prediction
# use the probability results of the logistic regression
# can also add num_features parameter to reduce the number of features explained
LIME_exp = LIME_explainer.explain_instance(ls_X_test[idx], c.predict_proba)
LIME_exp.show_in_notebook(text=True, predict_proba=True)
Using the kernalSHAP, first you need to find the shaply value and then find the single instance, as following below;
#convert your training and testing data using the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_train = tfidf_vectorizer.fit_transform(IV_train)
tfidf_test = tfidf_vectorizer.transform(IV_test)
model=LogisticRegression()
model.fit(tfidf_train, DV_train)
#shap apply
#first shorten the data & convert to data frame
X_train_sample = tfidf_train[0:20]
sample_text = pd.DataFrame(X_test_sample)
SHAP_explainer = shap.KernelExplainer(model.predict, X_train_sample)
shap_vals = SHAP_explainer.shap_values(X_test_sample)
#print it.
print(df_test.iloc[7].Text , df_test.iloc[7].Label)
shap.initjs()
shap.force_plot(SHAP_explainer.expected_value, shap_vals[7,:],sample_text.iloc[7,:], feature_names=tfidf_vectorizer.get_feature_names_out())
as the original text is "good article interested natural alternatives treat ADHD" and Label is "1"
I am new to Machine Learning and am in the process of trying to run a simple classification model that I trained and saved using pickle, on another dataset of the same format. I have the following python code.
Code
#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')
print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)
features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)
features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)
labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])
features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)
feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)
def add_missing_dummy_columns(d, columns):
missing_cols = set(columns) - set(d.columns)
for c in missing_cols:
d[c] = 0
def fix_columns(d, columns):
add_missing_dummy_columns(d, columns)
# make sure we have all the columns we need
assert (set(columns) - set(d.columns) == set())
extra_cols = set(d.columns) - set(columns)
if extra_cols: print("extra columns:", extra_cols)
d = d[columns]
return d
testFeatures = fix_columns(testFeatures, features.columns)
features = np.array(features)
testFeatures = np.array(testFeatures)
train_samples = 100
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
print(colored('\n TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)
print(colored('\n TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)
from sklearn.metrics import precision_recall_fscore_support
import pickle
loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)
loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)
loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)
loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)
I am able to get the results for the test set.
But the problem I am facing is that I need to run the model on the entire Test_sop_Computed.csv dataset. But it is only being run on the test dataset that I've split.
I would sincerely appreciate if anyone could provide any suggestions on how I can run the loaded model on the entire dataset. I know that I'm going wrong with the following line of code.
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
Both the train and test dataset have the Subject, Predicate, Object, Computed and Truth and the features with the Truth being the predicted class. The testing dataset has the actual values for this Truth column and I dopr it usingtestFeatures = testFeatures.drop('Truth', axis = 1) and intend on using the various loaded models of classifiers to predict this Truth as 0 or 1 for the entire dataset and then get the predictions as an array.
I have done this so far. But I think that I am splitting my test dataset as well. Is there a way to pass the entire test dataset even if it is in another file?
This test dataset is in the same format as the training set. I have checked the shape of the two and I get the following.
Confirming the Features and Shape
Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)
TRAINING SET
Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)
TEST SETS
Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)
Any suggestions in this regard will be highly appreciated.
Your question is a bit unclear but as I understand, you want to run your model on testX_train and on testX_test (which is just testFeatures splitted into two sub datasets).
So, either you can run your model on testX_train the same way you did for testX_test, e.g. :
result_RFC_train = loaded_model_RFC.score(textX_train, testy_train)
or you can just remove the following line :
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
So you just don't split you data and run it on the full dataset :
result_RFC_train = loaded_model_RFC.score(testFeatures, testlabels)
I have a saved logistic regression model which I trained with training data and saved using joblib. I am trying to load this model in a different script, pass it new data and make a prediction based on the new data.
I am getting the following error "sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted." Do I need to fit the data again ? I would have thought that the point of being able to save the model would be to not have to do this.
The code I am using is below excluding the data cleaning section. Any help to get the prediction to work would be appreciated.
new_df = pd.DataFrame(latest_tweets,columns=['text'])
new_df.to_csv('new_tweet.csv',encoding='utf-8')
csv = 'new_tweet.csv'
latest_df = pd.read_csv(csv)
latest_df.dropna(inplace=True)
latest_df.reset_index(drop=True,inplace=True)
new_x = latest_df.text
loaded_model = joblib.load("finalized_mode.sav")
tfidf_transformer = TfidfTransformer()
cvec = CountVectorizer()
x_val_vec = cvec.transform(new_x)
X_val_tfidf = tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
print (result)
Your training part have 3 parts which are fitting the data:
CountVectorizer: Learns the vocabulary of the training data and returns counts
TfidfTransformer: Learns the counts of the vocabulary from previous part, and returns tfidf
LogisticRegression: Learns the coefficients for features for optimum classification performance.
Since each part is learning something about the data and using it to output the transformed data, you need to have all 3 parts while testing on new data. But you are only saving the lr with joblib, so the other two are lost and with it is lost the training data vocabulary and count.
Now in your testing part, you are initializing new CountVectorizer and TfidfTransformer, and calling fit() (fit_transform()), which will learn the vocabulary only from this new data. So the words will be less than the training words. But then you loaded the previously saved LR model, which expects the data according to features like training data. Hence this error:
ValueError: X has 130 features per sample; expecting 223086
What you need to do is this:
During training:
filename = 'finalized_model.sav'
joblib.dump(lr, filename)
filename = 'finalized_countvectorizer.sav'
joblib.dump(cvec, filename)
filename = 'finalized_tfidftransformer.sav'
joblib.dump(tfidf_transformer, filename)
During testing
loaded_model = joblib.load("finalized_model.sav")
loaded_cvec = joblib.load("finalized_countvectorizer.sav")
loaded_tfidf_transformer = joblib.load("finalized_tfidftransformer.sav")
# Observe that I only use transform(), not fit_transform()
x_val_vec = loaded_cvec.transform(new_x)
X_val_tfidf = loaded_tfidf_transformer.transform(x_val_vec)
result = loaded_model.predict(X_val_tfidf)
Now you wont get that error.
Recommendation:
You should use TfidfVectorizer in place of both CountVectorizer and TfidfTransformer, so that you dont have to use two objects all the time.
And along with that you should use Pipeline to combine the two steps:- TfidfVectorizer and LogisticRegression, so that you only have to use a single object (which is easier to save and load and generic handling).
So edit the training part like this:
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
# Internally your X_train will be automatically converted to tfidf
# and that will be passed to lr
tfidf_lr_pipe.fit(X_train, y_train)
# Similarly here only transform() will be called internally for tfidfvectorizer
# And that data will be passed to lr.predict()
y_preds = tfidf_lr_pipe.predict(x_test)
# Now you can save this pipeline alone (which will save all its internal parts)
filename = 'finalized_model.sav'
joblib.dump(tfidf_lr_pipe, filename)
During testing, do this:
loaded_pipe = joblib.load("finalized_model.sav")
result = loaded_model.predict(new_x)
You have not fit the CountVectorizer.
You should do like this..
cvec = CountVectorizer()
x_val_vec = cvec.fit_transform(new_x)
Similarly, TfidTransformer must be used like this..
X_val_tfidf = tfidf_transformer.fit_transform(x_val_vec)
This is my first attempt of document classification with ML and Python.
I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
Then I extract 500 articles not related to money laundering and also convert them to pandas df
I concatenate both dfs and label them either 'money-laundering' or 'other'
I do preprocessing (removing punctuation and stopwords, lower case etc)
and then feed the model based on bag of words principle as below:
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
text_features = vectorizer.fit_transform(full_df["processed full text"])
text_features = text_features.toarray()
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
accuracy_score(y_pred=y_pred, y_true=y_test)
It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:
ValueError: Number of features of the model must match the input. Model n_features is 5000 and input n_features is 45
I am not sure how to overcome this to be able to classify totally new article.
First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code.
Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test.
Then, let define full_added the corpus of texts with full_df and test.
text_features = vectorizer.fit_transform(full_added)
text_features = text_features.toarray()
You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])).
And then you can run
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(test)
Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.
Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.
I hope it helps.
My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.
The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:
#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values
#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)
#create classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)
#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction
output:
array(['money laundering'],
dtype='<U16')
And the accuracy is around 91% so more realistic than 99.96%..
Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.