Classification using SVM - python

In an attempt to classify text I want to use SVM.
I want to classify test data into one of the labels(health/adult)
The training & test data are text files
I am using python's scikit library.
While I was saving the text to txt files I encoded it in utf-8
that's why i am decoding them in the snippet.
Here's my attempted code
String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b', min_df=1)
X_2 = bigram_vectorizer.fit_transform(String2).toarray()
X_1 = bigram_vectorizer.fit_transform(String).toarray()
X_train = np.array([X_1,X_2])
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)
#prepare test data
print(clf.predict(X))
This is the error I am getting
File "/Users/guru/python_projects/implement_LDA/lda/apply.py", line 107, in <module>
clf.fit(X_train, y)
File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Users/guru/python_projects/implement_LDA/lda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
When I searched for the error, I found some results but they even didn't help. I think I am logically wrong here in applying SVM model. Can someone give me a hint on this?
Ref: [1][2]

You have to combine your samples, vectorize them and then fit the classifier. Like this:
String = String.decode('utf-8')
String2 = String2.decode('utf-8')
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b', min_df=1)
X_train = bigram_vectorizer.fit_transform(np.array([String, String2]))
print type(X_train)
y = np.array([1, 2])
clf = SVC()
clf.fit(X_train, y)
#prepare test data
print(clf.predict(bigram_vectorizer.transform(np.array([X1, X2, ...]))))
But 2 sample it's a very few amount of data so likely your prediction will not be accurate.
EDITED:
Also you can combine transformation and classification in one step using Pipeline.
from sklearn.pipeline import Pipeline
print type(X_train) # Should be a list of texts length 100 in your case
y_train = ... # Should be also a list of length 100
clf = Pipeline([
('transformer', CountVectorizer(...)),
('estimator', SVC()),
])
clf.fit(X_train, y_train)
X_test = np.array(["sometext"]) # array of test texts length = 1
print(clf.predict(X_test))

Related

ValueError: Unknown label type: 'continuous' when using clustering + classification models together

I created a clustering model to try and find different groups of customers based on annual income and spending score using the KMeans algorithm from Scikit-Learn. Using the cluster value that it returned for each customer, I tried to create a classification model using Support Vector Classification from sklearn.svm. When I tried to fit the new model onto the dataset, however, I got an error message:
File "/Users/user/Documents/Machine Learning A-Z Template Folder/Part 4 - Clustering/Section 24 - K-Means Clustering/cluster_and_prediction.py", line 28, in <module>
classifier.fit(x_train, y_train)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 149, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 525, in _validate_targets
check_classification_targets(y)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 169, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
My code is as follows
import pandas as pd
import numpy as np
# Using relevant columns from dataset
dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, 3:5].values
# Creating model with ideal amount of clusters
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(x)
predictions = kmeans.predict(x)
# Creating numpy array for feature scaling
predictions = np.array(predictions, dtype=int)
predictions = predictions[:, None]
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
predictions = sc_y.fit_transform(predictions)
# Splitting dataset into training and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)
# Creating Support Vector Classification model
from sklearn.svm import SVC
classifier = SVC(kernel='rbf')
classifier.fit(x_train, y_train)
Elbow Model Used for Clustering
Clustering Visualization
.zip file with the dataset(the dataset is called 'Mall_Customers.csv'
How can I fix this?
Since you want to address this as a classification problem with 5 classes, you should not use a scaler for your labels; this converts them to continuous variables fed in a classification model, hence the error.
Also, irrelevant to the issue, but the correct methodology is to fit your scaler on your training data only, and then use this fitted scaler to transform your test data.
So, here are the necessary changes (after you have finished with setting your predictions variable):
# initial (unscaled) x used here:
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)
sc = StandardScaler()
x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)
classifier = SVC(kernel='rbf')
classifier.fit(x_train_scaled, y_train) # no scaling for predictions or y_train
Also irrelevant to the issue, but you should scale your x data before using k-means, i.e. you should actually scale your x first and then perform your clustering (leaving it as an exercise, as it has nothing to do with the error).

When predicting on a single sentence, receive the error "Number of features of the model must match the input."

I'm a data science newbie and I'm trying to use TfidfVectorizer with RandomForestClassifier to predict a binary "yes/no" outcome on a string like so:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('~/Downloads/New_Query_2019_12_04.csv', usecols=['statement', 'result'])
df = df.head(100)
# remove non-values
df = df.dropna()
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(df['statement']).toarray()
y = df['result'].values
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=0)
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
All of this appears to work great, but I'm stuck on how to predict a phrase against the model. When I do something like:
good_string = preprocess_string('This is a good sentence')
tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform([good_string]).toarray()
y_pred = classifier.predict(X)
I get the error "Number of features of the model must match the input."
I also tried fitting the string with my previous TfidfVectorizer:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform([good_string]).toarray()
but I got the error "max_df corresponds to < documents than min_df". I think I'm just a bit confused as to how to fit the array features of the single string to match the number features in my model. Any help would be greatly appreciated.
The issue was that I was running it through a different vectorizer with the same constructor params:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
instead of using the same vectorizer I used when fitting the documents here:
X = tfidfconverter.fit_transform(df['statement']).toarray()
I also should not have been attempting to fit the data I was trying to predict, but ONLY transform it.
X = tfidfconverter.transform([good_string]).toarray()

Python sklearn polynomial preprocessing and dimensional problems

I am experimenting the fit of 1-3 degree polynomial transformation to the original data using 100 predicted values each. I first 1) reshaped the original data, 2) applied fit_transform on the test set and prediction space (of data features), 3) obtained linear prediction on the prediction space, and 4) exported them into an array, using the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
pred_data = np.linspace(0,10,100).reshape(-1,1)
results = []
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
pred_data = poly.fit_transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data)
results.append(pred)
results
However, I did not get what I wanted, Python did not return an array of (3, 100) shape as I was expecting and, in fact, I received an error message
ValueError: shapes (100,10) and (4,1) not aligned: 10 (dim 1) != 4 (dim 0)
Seems to be a dimensional problem resulting either from "reshape" or from the "fit_transform" step. I got confused as this was supposed to be straightforward test. Would anyone enlighten me on this? It will be much appreciated.
Thank you.
Sincerely,
First, as I suggested in comment, you should always call just transform() on test data (pred_data in your case).
But even if you do that, a different error occurs. The error is due to this line:
pred_data = poly.fit_transform(pred_data)
Here you are replacing the original pred_data with the transformed version. So for first iteration of loop, it works, but for second and third iteration it becomes invalid, because it requires the original pred_data of shape (100,1) defined in this line above the for loop:
pred_data = np.linspace(0,10,100).reshape(-1,1)
Change the name of variable inside the loop to something else and all works well.
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
# Changed here
pred_data_poly1 = poly.transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data_poly1)
results.append(pred)
results

ValueError: array length does not match index length

I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.
I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.
Here is one modified script through which I am trying to implement XGBoost:
import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb
df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")
df_train = df_train.replace(-999999,2)
id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
clf = xgb.XGBClassifier(objective='binary:logistic',
missing=9999999999,
max_depth = 7,
n_estimators=200,
learning_rate=0.1,
nthread=4,
subsample=1.0,
colsample_bytree=0.5,
min_child_weight = 3,
reg_alpha=0.01,
seed=7)
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)
print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train)
'''
test = []
result = []
for each in id_test:
test.append(each)
for each in y_pred[:,1]:
result.append(each)
print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)
Here is the stacktrace :
Traceback (most recent call last):
File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818
I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know
The problem is that you defining X_test twice as #maxymoo mentioned. First you defined it as
X_test = df_test.drop(['ID'], axis=1).values
And then you redefine that with:
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
Which means now X_test have size equal to 0.4*len(X_train). Then after:
y_pred = clf.predict_proba(X_test)
you've got predictions for that part of X_train and you trying to create dataframe with that and initial id_test which has length of the original X_test.
You could use X_fit and X_eval in train_test_split and not hide initial X_train and X_test because for your cross_validation you also has different X_train which means you'll not get right answer or you cv would be inaccurate with public/private score.

scikit-learn ValueError: dimension mismatch

This is my first time posting here. For the past couple of days I have been trying to teach myself scikit-learn. But recently I have encountered an error that has been nagging me for quite some time.
My goal is simply to train a NB classifier cli so that I can feed it an arbitrary list of strings called new_doc and it will predict what class the string is likely to belong to.
This is what my program looks like:
#Importing stuff
import numpy as np
import pylab
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn import metrics
#Opening the csv file
df = pd.read_csv('data.csv', sep=',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
#Extracting features from text, define target y and data X
vect = CountVectorizer()
X = vect.fit_transform(df['Features'])
y = df['Target']
#Partitioning the data into test and training set
SPLIT_PERC = 0.75
split_size = int(len(y)*SPLIT_PERC)
X_train = X[:split_size]
X_test = X[split_size:]
y_train = y[:split_size]
y_test = y[split_size:]
#Training the model
clf = MultinomialNB()
clf.fit(X_train, y_train)
#Evaluating the results
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
#Predicting new data
new_doc = ["MacDonalds", "Walmart", "Target", "Starbucks"]
trans_doc = vect.transform(new_doc) #extracting features
y_pred = clf.predict(trans_doc) #predicting
But when I run the program I get the following error on the last row:
y_pred = clf.predict(trans_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 62, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 441, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 175, in safe_sparse_dot
ret = a * b
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 334, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
So apparently it has something to do with the dimension of the term-document matrixes.
When I check the dimensions of trans_doc, X_train and X_test i get:
>>> trans_doc.shape
(4, 4)
>>> X_train.shape
(145314, 28750)
>>> X_test.shape
(48439, 28750)
In order for y_pred = clf.predict(trans_doc) to work I need to (from what I understand it) transform new_doc into a term-document matrix with the dimensions (4, 28750). But I don't know of any methods within CountVectorizer that lets me do this.

Categories

Resources