This is my first time posting here. For the past couple of days I have been trying to teach myself scikit-learn. But recently I have encountered an error that has been nagging me for quite some time.
My goal is simply to train a NB classifier cli so that I can feed it an arbitrary list of strings called new_doc and it will predict what class the string is likely to belong to.
This is what my program looks like:
#Importing stuff
import numpy as np
import pylab
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn import metrics
#Opening the csv file
df = pd.read_csv('data.csv', sep=',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
#Extracting features from text, define target y and data X
vect = CountVectorizer()
X = vect.fit_transform(df['Features'])
y = df['Target']
#Partitioning the data into test and training set
SPLIT_PERC = 0.75
split_size = int(len(y)*SPLIT_PERC)
X_train = X[:split_size]
X_test = X[split_size:]
y_train = y[:split_size]
y_test = y[split_size:]
#Training the model
clf = MultinomialNB()
clf.fit(X_train, y_train)
#Evaluating the results
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
#Predicting new data
new_doc = ["MacDonalds", "Walmart", "Target", "Starbucks"]
trans_doc = vect.transform(new_doc) #extracting features
y_pred = clf.predict(trans_doc) #predicting
But when I run the program I get the following error on the last row:
y_pred = clf.predict(trans_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 62, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 441, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 175, in safe_sparse_dot
ret = a * b
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 334, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
So apparently it has something to do with the dimension of the term-document matrixes.
When I check the dimensions of trans_doc, X_train and X_test i get:
>>> trans_doc.shape
(4, 4)
>>> X_train.shape
(145314, 28750)
>>> X_test.shape
(48439, 28750)
In order for y_pred = clf.predict(trans_doc) to work I need to (from what I understand it) transform new_doc into a term-document matrix with the dimensions (4, 28750). But I don't know of any methods within CountVectorizer that lets me do this.
Related
I used Logistic regression to create a model ,later saved the model using joblib. Later i tried loading that model and predicting label in my test.csv . When ever i try this i get an error saying "X has 1433445 features per sample; expecting 3797015"
This is my initial code:-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
#reading data
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')
train=train.iloc[:,1:]
test=test.iloc[:,1:]
test.info()
train.info()
test['label']='t'
test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
.format(logreg.score(X_test, y_test)))
targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)
example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()
#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')
Later i loaded model in a different code that is :-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load
test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
#load_model
logreg = load('mypredmodel1.joblib')
example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)
When i run it, i get the error:
predictions = logreg.predict(example_counts)
Traceback (most recent call last):
File "<ipython-input-58-f28afd294d38>", line 1, in <module>
predictions = logreg.predict(example_counts)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
scores = self.decision_function(X)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 1433445 features per sample; expecting 3797015
Most probably, this is because you are re-fitting your transformers in the test set. This must not be done - you should also save them fitted in your training set, and use the test (or any other future) set only for transforming data.
This is easier done with pipelines.
So, remove the following code from your first block:
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
and replace it with:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('counts', CountVectorizer(ngram_range=(1, 2)),
('tf-idf', TfidfTransformer(smooth_idf=False))
])
pipeline.fit(train['total'].values)
tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values
test_tfidf = pipeline.transform(test['total'].values)
dump(pipeline, 'transform_predict.joblib')
Now, in your second code block, remove this part:
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
and replace it with:
pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)
And you should be fine, provided that you predict the test_tfidf variable, and not the example_counts which are not transfomed by TF-IDF:
predictions = logreg.predict(test_tfidf)
I am not able to encode data using label encoder in scikit learn.
dataset.csv has two columns text and label
I am trying to read text from dataset into a list and labels into another list and adding these lists to a dataframe but it doesn't seem to work.
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
import pandas, xgboost, numpy, string
data = open('dataset.csv').read()
labels = []
texts = []
for i ,line in enumerate(data.split("\n")):
content = line.split("\",")
texts.append(content[0])
labels.append(content[1:])
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'],trainDF['label'],test_size = 0.2,random_state = 0)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['texts'])
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
accuracy = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf)
print(accuracy)
Error:
Traceback (most recent call last):
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 105, in _encode
res = _encode_python(values, uniques, encode)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 59, in _encode_python
uniques = sorted(set(values))
TypeError: unhashable type: 'list'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Classifier.py", line 21, in <module>
train_y = encoder.fit_transform(train_y)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 236, in fit_transform
self.classes_, y = _encode(y, encode=True)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 107, in _encode
raise TypeError("argument must be a string or number")
TypeError: argument must be a string or number
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
import pandas, xgboost, numpy, string
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
data = open('dataset.csv').read()
labels = []
texts = []
for i ,line in enumerate(data.split("\n")):
content = line.split("\",")
texts.append(str(content[0]))
labels.append(str(content[1:]))
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'],trainDF['label'],test_size = 0.2,random_state = 0)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='rbf'))])
text_clf.fit(train_x, train_y)
predicted = text_clf.predict(valid_x)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(valid_y,predicted))
print(classification_report(valid_y,predicted))
print(accuracy_score(valid_y,predicted))
In an attempt to classify text I want to use SVM.
I want to classify test data into one of the labels(health/adult)
The training & test data are text files
I am using python's scikit library.
While I was saving the text to txt files I encoded it in utf-8
that's why i am decoding them in the snippet.
Here's my attempted code
String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b', min_df=1)
X_2 = bigram_vectorizer.fit_transform(String2).toarray()
X_1 = bigram_vectorizer.fit_transform(String).toarray()
X_train = np.array([X_1,X_2])
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)
#prepare test data
print(clf.predict(X))
This is the error I am getting
File "/Users/guru/python_projects/implement_LDA/lda/apply.py", line 107, in <module>
clf.fit(X_train, y)
File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
When I searched for the error, I found some results but they even didn't help. I think I am logically wrong here in applying SVM model. Can someone give me a hint on this?
Ref: [1][2]
You have to combine your samples, vectorize them and then fit the classifier. Like this:
String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b', min_df=1)
X_train = bigram_vectorizer.fit_transform(np.array([String, String2]))
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)
#prepare test data
print(clf.predict(bigram_vectorizer.transform(np.array([X1, X2, ...]))))
But 2 sample it's a very few amount of data so likely your prediction will not be accurate.
EDITED:
Also you can combine transformation and classification in one step using Pipeline.
from sklearn.pipeline import Pipeline
print type(X_train) # Should be a list of texts length 100 in your case
y_train = ... # Should be also a list of length 100
clf = Pipeline([
('transformer', CountVectorizer(...)),
('estimator', SVC()),
])
clf.fit(X_train, y_train)
X_test = np.array(["sometext"]) # array of test texts length = 1
print(clf.predict(X_test))
I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.
I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.
Here is one modified script through which I am trying to implement XGBoost:
import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb
df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")
df_train = df_train.replace(-999999,2)
id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
clf = xgb.XGBClassifier(objective='binary:logistic',
missing=9999999999,
max_depth = 7,
n_estimators=200,
learning_rate=0.1,
nthread=4,
subsample=1.0,
colsample_bytree=0.5,
min_child_weight = 3,
reg_alpha=0.01,
seed=7)
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)
print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train)
'''
test = []
result = []
for each in id_test:
test.append(each)
for each in y_pred[:,1]:
result.append(each)
print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)
Here is the stacktrace :
Traceback (most recent call last):
File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818
I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know
The problem is that you defining X_test twice as #maxymoo mentioned. First you defined it as
X_test = df_test.drop(['ID'], axis=1).values
And then you redefine that with:
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
Which means now X_test have size equal to 0.4*len(X_train). Then after:
y_pred = clf.predict_proba(X_test)
you've got predictions for that part of X_train and you trying to create dataframe with that and initial id_test which has length of the original X_test.
You could use X_fit and X_eval in train_test_split and not hide initial X_train and X_test because for your cross_validation you also has different X_train which means you'll not get right answer or you cv would be inaccurate with public/private score.
I'm trying to use GridSearchCV to optimize the parameters for the classifier svm.SVC (both from sklearn).
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import numpy as np
X_train = np.array([[1,2],[3,4],[5,6],[2,3],[9,4],[4,5],[2,7],[1,0],[4,7],[2,9])
Y_train = np.array([0,1,0,1,0,0,1,1,0,1])
X_test = np.array([[2,4],[5,3],[7,1],[2,4],[6,4],[2,7],[9,2],[7,5],[1,6],[0,3]])
Y_test = np.array([1,0,0,0,1,0,1,1,0,0])
parameters = {'kernel':['rbf'],'C':np.linspace(10,100,10)}
clf1 = GridSearchCV(SVC(), parameters, verbose = 10)
clf1.fit(X_train, Y_train)
cm = confusion_matrix(Y_test, clf1.predict(X_test))
bp = clf1.best_params_
The output shows it completing GridSearchCV, but then it throws the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 479, in runfile
execfile(filename, namespace)
File "I:\setup\Desktop\Stats\FinalProject.py", line 112, in <module>
clf1 = GridSearchCV(SVC(), parameters, verbose = 10)
TypeError: 'dict' object is not callable
When I am running the code you posted:
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import numpy as np
X_train = np.array([[1,2],[3,4],[5,6]])
Y_train = np.array([0,1,0])
X_test = np.array([[2,4],[5,3],[7,1]])
Y_test = np.array([1,0,0])
parameters = {'kernel':['rbf'],'C':np.linspace(10,100,10)}
clf1 = GridSearchCV(SVC(), parameters, verbose = 10)
clf1.fit(X_train, Y_train)
cm = confusion_matrix(Y_test, clf1.predict(X_test))
bp = clf1.best_params_
I'm getting this error:
File "C:\Anaconda\lib\site-packages\sklearn\svm\base.py", line 447, in _validate_targets
% len(cls))
ValueError: The number of classes has to be greater than one; got 1
Since the train data consist of 3 samples, when the GridSearchCV break the data into 3 folds (BTW you can control this parameter, it is called cv).
e.g. -
fold1 = [1,2] , label1 = 0
fold2 = [3,4] , label2 = 1
fold3 = [5,6] , label3 = 0
Now, in some iteration, it takes the first and the third folds to train on, and the second fold is used for validation.
Please note that these training folds contains only 1 type of label! (the label 0) hence the error it prints.
If I create the data in this manner:
X, Y = datasets.make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=2, n_classes=2)
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X,Y,
test_size =0.2)
It runs just fine.
I guess you have some other problem, but regarding the code you entered - this is the error it has.