Referring to the post at How to use save model for prediction in python
when I load and predict with the new data..I am getting the following error.
is there anything we can do to resolve it?
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2019' in position 510: invalid decimal Unicode string
my Entire code....
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['IssueDetails'], df['CRST'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, y_train)
cif_svm = Pipeline([('tfidf', tfidf_transformer), ('SVC', clf)])
from sklearn.externals import joblib
joblib.dump(cif_svm, 'modelsvm.pk1')
Fitmodel = joblib.load('modelsvm.pk1')
Fitmodel.predict(df_v)
I found the answer for my question above. I used the below code for prediction
datad['CRSTS']=datad['Detail'].apply(lambda x: unicode(clf.predict(count_vect.transform([x]))))
Related
I am just Starting this naive Bayes Theoram and Getting stuck by this please tell me what to do,
this is my code ::
import sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
df = (pd.read_csv("C:\\Users\\dhana\\Downloads\\Mechanics\\train.csv"))
X = []
Y = []
for i in df.Age:
X.append(str(i))
for j in df.Fare:
Y.append(j)
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state= 17)
model = GaussianNB()
model.fit(X_train,y_train)
print(model)
y_expect = y_test
y_pred = model.predict(X_test)
print(accuracy_score(y_expect,y_pred))```
my goal is to do text classification using ML supervised algorithms. I'm at the stage where I need to make my words so that computer would understand them. I'm trying vectorize method but I get error 'Series' object has no attribute 'lower. Is there other solution to prepare data for sentiment analysis? Or i'm going the right path and just need to find out how to vectorize words? My data is shown in picture and code is below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn import preprocessing
#tostr = svietimas_data['text'].astype(str).tolist()
print(svietimas_data)
tfidf = TfidfVectorizer(max_features=3000)
X = svietimas_data['text']
y = svietimas_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X = tfidf.fit_transform([X])
I used Logistic regression to create a model ,later saved the model using joblib. Later i tried loading that model and predicting label in my test.csv . When ever i try this i get an error saying "X has 1433445 features per sample; expecting 3797015"
This is my initial code:-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
#reading data
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')
train=train.iloc[:,1:]
test=test.iloc[:,1:]
test.info()
train.info()
test['label']='t'
test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
.format(logreg.score(X_test, y_test)))
targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)
example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()
#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')
Later i loaded model in a different code that is :-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load
test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
#load_model
logreg = load('mypredmodel1.joblib')
example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)
When i run it, i get the error:
predictions = logreg.predict(example_counts)
Traceback (most recent call last):
File "<ipython-input-58-f28afd294d38>", line 1, in <module>
predictions = logreg.predict(example_counts)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
scores = self.decision_function(X)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 1433445 features per sample; expecting 3797015
Most probably, this is because you are re-fitting your transformers in the test set. This must not be done - you should also save them fitted in your training set, and use the test (or any other future) set only for transforming data.
This is easier done with pipelines.
So, remove the following code from your first block:
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
and replace it with:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('counts', CountVectorizer(ngram_range=(1, 2)),
('tf-idf', TfidfTransformer(smooth_idf=False))
])
pipeline.fit(train['total'].values)
tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values
test_tfidf = pipeline.transform(test['total'].values)
dump(pipeline, 'transform_predict.joblib')
Now, in your second code block, remove this part:
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
and replace it with:
pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)
And you should be fine, provided that you predict the test_tfidf variable, and not the example_counts which are not transfomed by TF-IDF:
predictions = logreg.predict(test_tfidf)
I have a json file that file have preprocess data at the same time that data is also change vector.then how to train the data using SVM classification method
Vector is one name of the column
another one is values, values have genres of vector column
import pickle
from nltk.corpus import stopwords
import string
from nltk.stem import SnowballStemmer
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
stopwords=set(stopwords.words("english"))
exclude = set(string.punctuation)
snow=SnowballStemmer("english")
tvec = pickle.load(open("dataPackage/tfidf.pickle", 'rb'))
data=pd.read_json("dataPackage/finalData.json",orient = 'split')
inputLen = len(data["Vector"].iloc[0])
X = list(data["Vector"])
y = list(data.drop(["Vector"],axis = 1).values)
np.shape(X)
np.shape(y)
X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y), test_size=0.3,random_state=109)
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
I'm currently trying to find the pronounceability of a list of words using this SO question
The following code is as follows:
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
This outputs with random words this
>>> text_clf.predict("scaroly".split())
['word']
I've been checking the scikit documentation but I still can't seem to find out how I would have it print the score of the input word.
Try sklearn.pipeline.Pipeline.predict_proba:
>>> text_clf.predict_proba(["scaroly"])
array([[ 5.87363027e-04, 9.99412637e-01]])
It returns the likelihood a given input (in this case, "scaroly") belongs to the classes upon which you trained the model. So there's a 99.94% chance "scaroly" is pronounceable.
Conversely, the Welsh word for "new" is likely unpronounceable:
>>> text_clf.predict_proba(["newydd"])
array([[ 0.99666533, 0.00333467]])