Get accuracy of guess - python

I'm currently trying to find the pronounceability of a list of words using this SO question
The following code is as follows:
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
This outputs with random words this
>>> text_clf.predict("scaroly".split())
['word']
I've been checking the scikit documentation but I still can't seem to find out how I would have it print the score of the input word.

Try sklearn.pipeline.Pipeline.predict_proba:
>>> text_clf.predict_proba(["scaroly"])
array([[ 5.87363027e-04, 9.99412637e-01]])
It returns the likelihood a given input (in this case, "scaroly") belongs to the classes upon which you trained the model. So there's a 99.94% chance "scaroly" is pronounceable.
Conversely, the Welsh word for "new" is likely unpronounceable:
>>> text_clf.predict_proba(["newydd"])
array([[ 0.99666533, 0.00333467]])

Related

Using Sklearn, Category Predictions not working on the test data

Dataset: I created a very simple dataset of "Supplier", "Item description" columns . This dataset has a list of item descriptions and preferred supplier for that item
Requirement: I would like to write a program that will take an "Item Description" and predict the "Supplier". To keep it very simple, I just have only 5 Unique supplier-Item Description combinations out of the 950 rows in the .txt file
Issue: The accuracy shows up 1 and confusing matrix shows no false positives. But when I give a new data, the prediction is wrong.
Steps Done
Read .txt for "Supplier" and "Item Description"
Label Encoder applied on the "Item Description"
train test and split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
Created a Pipeline for applying the TfidfVectorizer and MultinomialNB
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
fit model and predict :
y_pred=model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
acc= accuracy_score(y_test,y_pred)
# acc is 1.0 and the cm shows no false positives/negatgives
So far, things look ok
dumped the pickle
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
Tried prediction on a Item Description= 'Lego, Barbie and other Toy Items' ; I was expecting "Toys R Us"
The prediction was wrong, it came up as "Office Depot".
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
Can you please let me know
what I am doing wrong here to get the wrong prediction, new_y_pred for the test item description passed in (new_X)
This is my first ML code. I have tried debugging this by looking at various articles, but no luck.
Thanks
== Complete Code, if it is helpful ==
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import re # librarie for cleaning data
import nltk # library for NLP
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import pickle
df=pd.read_csv('git_suppliers.txt', sep='\t')
# Prep the data - Item Description
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = PorterStemmer()
words = stopwords.words("english")
df['ITEM_DESCRIPTION'] = df['ITEM_DESCRIPTION'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z0-9]", " ", x).split() if i not in words]).lower())
# Feature Generation using the TF-IDF
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(df['ITEM_DESCRIPTION']).toarray()
final_features.shape
# final_features shows only 43 features - not going to use SelectKBest for such such less features count
#
# Split into training and test data
#
X = df['ITEM_DESCRIPTION']
y = df['SUPPLIER']
from sklearn.preprocessing import LabelEncoder
labelObj = LabelEncoder()
y=labelObj.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
y_test_decoded=labelObj.inverse_transform(y_test)
#
# Create a pipeline, fit the model, predict for test data and save in pickle
#
pipeline = Pipeline([('vect', vectorizer),
('clf', MultinomialNB())
])
model = pipeline.fit(X_train, y_train)
# Predict for test data
y_pred=model.predict(X_test)
# Accuracy shows up as 1.0 and the confusion matrix shows no false positives/negatives
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
acc= accuracy_score(y_test,y_pred)
print(acc)
# Dump the model and lets predict for one item description,
# for which i expect Toys R Us as the supplier/Seller
pickle.dump(model, open(r'supplier_predictions.pkl','wb'))
loadedModel = pickle.load(open("supplier_predictions.pkl","rb"))
new_items = {'ITEM_DESCRIPTION': ['Lego, Barbie and other Toy Items']}
new_X = pd.DataFrame(new_items, columns = ['ITEM_DESCRIPTION'])
new_y_pred=loadedModel.predict(new_X)
labelObj.inverse_transform(new_y_pred)
### Shows Office Depot
My bad - the input to the predict was wrong type. Passed in a series and it worked fine.
new_items = pd.Series(new_items)
new_y_pred=loadedModel.predict(new_items)
labelObj.inverse_transform(new_y_pred)

Predicting with a trained model

I used Logistic regression to create a model ,later saved the model using joblib. Later i tried loading that model and predicting label in my test.csv . When ever i try this i get an error saying "X has 1433445 features per sample; expecting 3797015"
This is my initial code:-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
#reading data
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')
train=train.iloc[:,1:]
test=test.iloc[:,1:]
test.info()
train.info()
test['label']='t'
test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
.format(logreg.score(X_test, y_test)))
targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)
example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()
#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')
Later i loaded model in a different code that is :-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load
test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
#load_model
logreg = load('mypredmodel1.joblib')
example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)
When i run it, i get the error:
predictions = logreg.predict(example_counts)
Traceback (most recent call last):
File "<ipython-input-58-f28afd294d38>", line 1, in <module>
predictions = logreg.predict(example_counts)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
scores = self.decision_function(X)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 1433445 features per sample; expecting 3797015
Most probably, this is because you are re-fitting your transformers in the test set. This must not be done - you should also save them fitted in your training set, and use the test (or any other future) set only for transforming data.
This is easier done with pipelines.
So, remove the following code from your first block:
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
and replace it with:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('counts', CountVectorizer(ngram_range=(1, 2)),
('tf-idf', TfidfTransformer(smooth_idf=False))
])
pipeline.fit(train['total'].values)
tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values
test_tfidf = pipeline.transform(test['total'].values)
dump(pipeline, 'transform_predict.joblib')
Now, in your second code block, remove this part:
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
and replace it with:
pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)
And you should be fine, provided that you predict the test_tfidf variable, and not the example_counts which are not transfomed by TF-IDF:
predictions = logreg.predict(test_tfidf)

how to add confusion matrix and k-fold 10 fold in sentiment analysis

I want to add an evaluation model using the cross-validation and confusion matrix k-fold (k = 10) method, but I'm confused
dataset : https://github.com/fadholifh/dats/blob/master/cpas.txt
Using Pyhon 3.7
import sklearn.metrics
import sen
import csv
import os
import re
import nltk
import scipy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
factorys = StemmerFactory()
stemmer = factorys.create_stemmer()
if __name__ == "__main__":
main()
the result is confusion matrix and for k-fold each fold has a percentage of F1-score, precission, and recall
df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
y = df[0].values
stop_words = stopwords.words('english')
stemmer = PorterStemmer()
def clean_text(text, stop_words, stemmer):
return " ".join([stemmer.stem(word) for word in word_tokenize(text)
if word not in stop_words and not word.isnumeric()])
X = np.array([clean_text(text, stop_words, stemmer) for text in X])
kfold = KFold(3, shuffle=True, random_state=33)
i = 1
for train_idx, test_idx in kfold.split(X):
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LinearSVC()
model.fit(X_train, y_train)
print ("Fold : {0}".format(i))
i += 1
print (classification_report(y_test, model.predict(X_test)))
The reason you use cross validation is for parameter tuning when the data is less. One can use grid search with CV to do this.
df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
labels = df[0].values
text = np.array([clean_text(text, stop_words, stemmer) for text in X])
idx = np.arange(len(text))
np.random.shuffle(idx)
text = text[idx]
labels = labels[idx]
pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('svm', LinearSVC())])
params = {
'vectorizer__ngram_range' : [(1,1),(1,2),(2,2)],
'vectorizer__lowercase' : [True, False],
'vectorizer__norm' : ['l1','l2']}
model = GridSearchCV(pipeline, params, cv=3, verbose=1)
model.fit(text, y)

Classification with one file with entirely the training and another file with entirely test

I am trying to make a classification in which one file is entirely the training and another file is entirely the test. It's possible? I tried:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))
vect = CountVectorizer()
X = vect.fit_transform(df['data_train'])
y = df['label']
X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
tf_transformer = TfidfTransformer(use_idf=False).fit(X)
X_train_tf = tf_transformer.transform(X)
X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape
tf_transformer = TfidfTransformer(use_idf=False).fit(X_T)
X_train_tf_teste = tf_transformer.transform(X_T)
X_train_tf_teste.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T)
X_train_tfidf_teste.shape
#RegLog
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))
print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print("cross validation")
scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
I have set test_size to zero because I do not want to have a partition in those files. And I also applied Count and TFIDF in the training and test file.
My output error:
Traceback (most recent call last):
File "classif.py", line 34, in
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
ValueError: too many values to unpack (expected 2)
The error you are getting in train_test_split is clearly indicated and solved by #Alexis. And once again I also suggest to not use train_test_split as it will not do anything except shuffling, which you have already done.
But I want to highlight another important point, i.e., If you are keeping your train and test files separately, then just don't fit vectorizers separately. It will create different columns for train and test files. Example:
cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()
Output:
['hi', 'is', 'overflow', 'stack', 'this']
test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()
Output:
['hi', 'is', 'not', 'overflow', 'stack', 'that']
Hence, fitting them separately will result in columns mismatch. So, you should merge train and test files firstly and then fit_transform vectorizer collectively, or if you don't have test data beforehand you could only transform the test data using vectorizer fitted on train data, which will ignore the words not present in train data.
So first, for the error you get , just write the code as follow, it should work.
X_train, y_train,_,_ = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test,_,_ = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
the code is made to return 4 sets and it is expected that you have 4 variables to receive them. Putting _ is just to let everyone know that you don't care about those outputs.
Second, i don't really know why you are doing this manipulation. If you want to shuffle the data, it's not the best way to do it. And you have already done it before.

How to use save SVM model for prediction

Referring to the post at How to use save model for prediction in python
when I load and predict with the new data..I am getting the following error.
is there anything we can do to resolve it?
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2019' in position 510: invalid decimal Unicode string
my Entire code....
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['IssueDetails'], df['CRST'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, y_train)
cif_svm = Pipeline([('tfidf', tfidf_transformer), ('SVC', clf)])
from sklearn.externals import joblib
joblib.dump(cif_svm, 'modelsvm.pk1')
Fitmodel = joblib.load('modelsvm.pk1')
Fitmodel.predict(df_v)
I found the answer for my question above. I used the below code for prediction
datad['CRSTS']=datad['Detail'].apply(lambda x: unicode(clf.predict(count_vect.transform([x]))))

Categories

Resources