classify new document - Random Forest, Bag of Words - python

This is my first attempt of document classification with ML and Python.
I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
Then I extract 500 articles not related to money laundering and also convert them to pandas df
I concatenate both dfs and label them either 'money-laundering' or 'other'
I do preprocessing (removing punctuation and stopwords, lower case etc)
and then feed the model based on bag of words principle as below:
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
text_features = vectorizer.fit_transform(full_df["processed full text"])
text_features = text_features.toarray()
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
accuracy_score(y_pred=y_pred, y_true=y_test)
It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:
ValueError: Number of features of the model must match the input. Model n_features is 5000 and input n_features is 45
I am not sure how to overcome this to be able to classify totally new article.

First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code.
Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test.
Then, let define full_added the corpus of texts with full_df and test.
text_features = vectorizer.fit_transform(full_added)
text_features = text_features.toarray()
You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])).
And then you can run
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(test)
Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.
Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.
I hope it helps.

My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.
The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:
#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values
#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)
#create classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)
#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction
output:
array(['money laundering'],
dtype='<U16')
And the accuracy is around 91% so more realistic than 99.96%..
Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.

Related

RandomForestRegressor: Found input variables with inconsistent numbers of samples

This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.
I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.
train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)
X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)
model.score(X_test_vec, y_test)
However I'm getting an error on the last line of code when I run it in the notebook.
ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]
Full error message: https://imgur.com/a/GS7jEi5
you are using x_train for both train and test and is the reason you are getting the error.
try:
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one
As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test

Use SHAP values to explain LogisticRegression Classification

I am trying to do some bad case analysis on my product categorization model using SHAP. My data looks something like this:
corpus_train, corpus_test, y_train, y_test = train_test_split(data['Name_Description'],
data['Category_Target'],
test_size = 0.2,
random_state=8)
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word')
X_train = vectorizer.fit_transform(corpus_train)
X_test = vectorizer.transform(corpus_test)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
X_train_sample = shap.sample(X_train, 100)
X_test_sample = shap.sample(X_test, 20)
masker = shap.maskers.Independent(data=X_test_sample)
explainer = shap.LinearExplainer(model, masker=masker)
shap_values = explainer.shap_values(X_test_sample)
X_test_array = X_test_sample.toarray()
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names(), class_names=data['Category'].unique())
Now to save space I didn't include the actual summary plot, but it looks fine. My issue is that I want to be able to analyze a single prediction and get something more along these lines:
In other words, I want to know which specific words contribute the most to the prediction. But when I run the code in cell 36 in the image above I get an
AttributeError: 'numpy.ndarray' object has no attribute 'output_names'
I'm still confused on the indexing of shap_values. How can I solve this?
I was unable to find a solution with SHAP, but I found a solution using LIME. The following code displays a very similar output where its easy to see how the model made its prediction and how much certain words contributed.
c = make_pipeline(vectorizer, classifier)
# saving a list of strings version of the X_test object
ls_X_test= list(corpus_test)
# saving the class names in a dictionary to increase interpretability
class_names = list(data.Category.unique())
# Create the LIME explainer
# add the class names for interpretability
LIME_explainer = LimeTextExplainer(class_names=class_names)
# explain the chosen prediction
# use the probability results of the logistic regression
# can also add num_features parameter to reduce the number of features explained
LIME_exp = LIME_explainer.explain_instance(ls_X_test[idx], c.predict_proba)
LIME_exp.show_in_notebook(text=True, predict_proba=True)
Using the kernalSHAP, first you need to find the shaply value and then find the single instance, as following below;
#convert your training and testing data using the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_train = tfidf_vectorizer.fit_transform(IV_train)
tfidf_test = tfidf_vectorizer.transform(IV_test)
model=LogisticRegression()
model.fit(tfidf_train, DV_train)
#shap apply
#first shorten the data & convert to data frame
X_train_sample = tfidf_train[0:20]
sample_text = pd.DataFrame(X_test_sample)
SHAP_explainer = shap.KernelExplainer(model.predict, X_train_sample)
shap_vals = SHAP_explainer.shap_values(X_test_sample)
#print it.
print(df_test.iloc[7].Text , df_test.iloc[7].Label)
shap.initjs()
shap.force_plot(SHAP_explainer.expected_value, shap_vals[7,:],sample_text.iloc[7,:], feature_names=tfidf_vectorizer.get_feature_names_out())
as the original text is "good article interested natural alternatives treat ADHD" and Label is "1"

I can't get my test accuracy to increase in a sentiment analysis

I'm not sure if this is the right place but my test accuracy is always at about .40 while I can get my training set accuracy to 1.0. I'm trying to do a sentiment analysis of tweets on trump, I have annotated each tweet with a positive,negative or neutral polarity. I want to be able to predict the polarity of new data based on my model. I've tried different models but the SVM seems to give me the highest test accuracy. I'm unsure as to why my data model accuracy is so low but would appreciate any help or direction.
trump = pd.read_csv("trump_data.csv", delimiter = ";")
#drop all nan values
trump = trump.dropna()
trump = trump.rename(columns = {"polarity,,,":"polarity"})
#print(trump.columns)
def tokenize(text):
ps = PorterStemmer()
return [ps.stem(w.lower()) for w in word_tokenize(text)
X = trump.text
y = trump.polarity
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state = 42)
svm = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'),
tokenizer=tokenize)), ('svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3,
random_state=42,max_iter=5, tol=None))])
svm.fit(X_train, y_train)
model = svm.score(X_test, y_test)
print("The svm Test Classification Accuracy is:", model )
print("The svm training set accuracy is : {}".format(naive.score(X_train,y_train)))
y_pred = svm.predict(X)
This is an example of one of the strings in the text column of the dataset
".#repbilljohnson congress must step up and overturn president trump’s discriminatory #eo banning #immigrants & #refugees #oxfam4refugees"
Data set
Why are you using naive.score? I assume it's a copy-paste mistake. Here are a few steps you can follow.
Make sure you enough data points and clean it. Cleaning the dataset is the inevitable process in data science.
Make use of the parameters like ngram_range, max_df, min_df, max_features while featurizing the text with either TfidfVectorizer or CountVectorizer. You may also try embeddings using Word2Vec.
Do a hyperparameter tuning on alpha, penalty & other variables using GridSearch or RandomizedSearchCV. Make sure you are CV currently. Refer the documentation for more info
If the dataset is imbalanced, then try using other matrices like log-loss, precision, recall, f1-score, etc. Refer this for more info.
Make sure your model is neither overfitted not underfitted by checking train-error & test error.
Other than SVM, also try the traditional models like Logistic Regression, NV, RF etc. If you have a large number of data points, then you may try Deep Learning models.
Turns out I needed to clean the polarity data set as it had values such as "positive," , "positive,," and "positive,,," hence not registering them as different so I just removed all "," from the column.

logistic regression classifier for prediction in Python

I am trying to make a script that takes a json file(pizza-train.json) (from this Kaggle competition. I want to extract the request_text field from each dictionary in the list, and construct a bag of words representation of the string (string to count-list).
The next step is to train a logistic regression classifier to predict the variable “requester_received_pizza”. I want to train the 90% of the data and predict the 10%. The problem is that I don't know how to predict the 10%. Any advice would be really helpfull!
import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
f_json = json.load(open('pizza-train.json'))
request_text = []
y = []
for item in f_json[:100]:
request_text.append(item['request_text'])
y.append(item['requester_received_pizza'])
vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')
train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()
print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab
x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)
You might do it like this:
alg = sklearn.linear_model.LogisticRegression()
alg.fit(x_train, y_train)
test_score = alg.score(x_test, y_test)
You should read the sklearn docs logistic regression and cross validation, which are very good and provide more sophisticated methods for validating your models. This tutorial for the Kaggle Titanic competition might also be useful.

Unigram Analysis with Scikit Learn

I am trying to do some analysis on unigrams in Sci Kit Learn. I created files in svmlight format and tried to run MultinomialNB() KNeighborsClassifier() and SVC(). We I first tried to do that with unigrams I got a X training dimension error presumably because the only unigrams that are included in a given example are the ones that show up in the training fit there. I tried creating svmlight format training files that include place holders for every seen unigram in the corpus even those not in that given example.
The problem is that inflated the training files from 3 MB to 300 MB. This caused memory errors for sklearn loading the files. Is there a way to get around the dimension mismatches or memory overflows.
X_train, y_train= load_svmlight_file(trainFile)
x_test, y_test = load_svmlight_file(testFile)
try:
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('Input data: ' + trainFile.split('.')[0])
print('naive_bayes')
print('accuracy: ' + str(accuracy_score(y_test, preds)))
if 1 in preds:
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
except Exception as inst:
print 'fail in NB ' + 'Input data: ' + trainFile.split('.')[0]
print str(inst)
pass
2828 test examples and 1212 test examples with 18000 distinct unigrams
EDIT I tried to use the sklearn CountVectorizer but I am still getting the memory issues. Is this the best way to do this?
def fileLoadForPipeline(trainSetFile, valSetFile):
with open(trainSetFile) as json_file:
tdata = json.load(json_file)
with open(valSetFile) as json_file:
vdata = json.load(json_file)
x_train = []
x_val = []
y_train = []
y_val = []
for t in tdata:
x_train.append(t['request_text'])
y_train.append(t['requester_received_pizza'])
for v in vdata:
x_val.append(t['request_text'])
y_val.append(t['requester_received_pizza'])
return x_train, y_train, x_val, y_val
def buildPipeline(trainset, valset, norm):
x_train, y_train, x_val, y_val = fileLoadForPipeline(trainset, valset)
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=ur'\b\w+\b', min_df=1)
xT = bigram_vectorizer.fit_transform(x_train).toarray()
xV = bigram_vectorizer.fit_transform(x_val).toarray()
if norm:
transformer = TfidfTransformer()
xT = transformer.fit_transform(xT)
xV = transformer.fit_transform(xV)
results = []
for clf, name in ((Perceptron(n_iter=50), "Perceptron"),
(KNeighborsClassifier(n_neighbors=40), "kNN"), (MultinomialNB), (MultinomialNB(alpha=.01),'MultinomialNB'),
(BernoulliNB(alpha=.1),'BernoulliNB'),(svm.SVC(class_weight='auto'),'svc')):
print 80 * '='
print name
results.append(benchmark(clf))
Try using scikit-learn's CountVectorizer which will do the feature extraction on raw text for you. Most importantly, the method fit_transform called on a set of training examples will automatically do the Bag of Words unigram transformation, where it keeps track of all n unique words found in the training corpus, and converts each document into an array of length n whose features can be either discrete word counts or binary presence features (depending on the binary option). The great thing about CountVectorizer is that it stores data in numpy sparse matrix format, which makes it very memory efficient, and should be able to solve any memory problems you're having.
You can then call transform on future testing examples, and it will do conversion like normal.
This should also help solve any dimensionality issues, as CountVectorizer's job is to regularize everything. Specific information on usage here:
http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage
An added benefit of this is that you can combine this vectorizer with a classifier using a Pipeline to make fitting and testing more convenient.

Categories

Resources