I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.
I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.
Here is one modified script through which I am trying to implement XGBoost:
import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb
df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")
df_train = df_train.replace(-999999,2)
id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
clf = xgb.XGBClassifier(objective='binary:logistic',
missing=9999999999,
max_depth = 7,
n_estimators=200,
learning_rate=0.1,
nthread=4,
subsample=1.0,
colsample_bytree=0.5,
min_child_weight = 3,
reg_alpha=0.01,
seed=7)
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)
print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train)
'''
test = []
result = []
for each in id_test:
test.append(each)
for each in y_pred[:,1]:
result.append(each)
print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)
Here is the stacktrace :
Traceback (most recent call last):
File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818
I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know
The problem is that you defining X_test twice as #maxymoo mentioned. First you defined it as
X_test = df_test.drop(['ID'], axis=1).values
And then you redefine that with:
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
Which means now X_test have size equal to 0.4*len(X_train). Then after:
y_pred = clf.predict_proba(X_test)
you've got predictions for that part of X_train and you trying to create dataframe with that and initial id_test which has length of the original X_test.
You could use X_fit and X_eval in train_test_split and not hide initial X_train and X_test because for your cross_validation you also has different X_train which means you'll not get right answer or you cv would be inaccurate with public/private score.
Related
X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape) # (540, 4196)
print (X_test.shape) # (136, 4196)
print (y_train.shape) # (540,)
print (y_test.shape) # (136,)
When fitting, it gives error:
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(features,y_train)
y_pred = classifier.predict(features)
Error:
ValueError: Found input variables with inconsistent numbers of samples: [676, 540]
I tried this.
You want to call the fit function with you X_train, not with features. The error occurs because features and y_train don't have the same size.
X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
You'll likely also want to call predict with X_test or X_train. You may want to learn a bit more about train/test splits and why they are used.
Why are you using the features along y_train for the .fit()? I think you are supposed to use X_train instead.
Instead of
classifier.fit(features, y_train)
Use:
classifier.fit(X_train, y_train)
You are trying to use two sets of data with different shape, since you did the split earlier. So features has more samples than y_train.
Also, for you predict line. It should be:
.predict(x_test)
I created a clustering model to try and find different groups of customers based on annual income and spending score using the KMeans algorithm from Scikit-Learn. Using the cluster value that it returned for each customer, I tried to create a classification model using Support Vector Classification from sklearn.svm. When I tried to fit the new model onto the dataset, however, I got an error message:
File "/Users/user/Documents/Machine Learning A-Z Template Folder/Part 4 - Clustering/Section 24 - K-Means Clustering/cluster_and_prediction.py", line 28, in <module>
classifier.fit(x_train, y_train)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 149, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/svm/_base.py", line 525, in _validate_targets
check_classification_targets(y)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 169, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
My code is as follows
import pandas as pd
import numpy as np
# Using relevant columns from dataset
dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, 3:5].values
# Creating model with ideal amount of clusters
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(x)
predictions = kmeans.predict(x)
# Creating numpy array for feature scaling
predictions = np.array(predictions, dtype=int)
predictions = predictions[:, None]
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
predictions = sc_y.fit_transform(predictions)
# Splitting dataset into training and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)
# Creating Support Vector Classification model
from sklearn.svm import SVC
classifier = SVC(kernel='rbf')
classifier.fit(x_train, y_train)
Elbow Model Used for Clustering
Clustering Visualization
.zip file with the dataset(the dataset is called 'Mall_Customers.csv'
How can I fix this?
Since you want to address this as a classification problem with 5 classes, you should not use a scaler for your labels; this converts them to continuous variables fed in a classification model, hence the error.
Also, irrelevant to the issue, but the correct methodology is to fit your scaler on your training data only, and then use this fitted scaler to transform your test data.
So, here are the necessary changes (after you have finished with setting your predictions variable):
# initial (unscaled) x used here:
x_train, x_test, y_train, y_test = train_test_split(x, predictions, test_size=.25)
sc = StandardScaler()
x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)
classifier = SVC(kernel='rbf')
classifier.fit(x_train_scaled, y_train) # no scaling for predictions or y_train
Also irrelevant to the issue, but you should scale your x data before using k-means, i.e. you should actually scale your x first and then perform your clustering (leaving it as an exercise, as it has nothing to do with the error).
I am running a Gaussian regression in Python. My data set has the shape of (10000,5). But when I try to fit the model I get an error:
AttributeError: 'list' object has no attribute 'n_dims'
How do I resolve this?
I initially thought this error is being caused as the dimension of my dependent variable might be different from the independent variable. But even after changing them to the same dimension, I am unable to find the problem with the code. Any help will be much appreciated.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic,
ExpSineSquared, DotProduct,
ConstantKernel)
data_set = pd.read_excel(r'XXXXX', sheet = 'Worksheet', header = 0)
data_set.head()
test_set = data_set
y = test_set.iloc[:,4]
test_set.drop(test_set.columns[4], axis = 1, inplace = True)
X = test_set
x=StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
y_train = np.array(y_train)
y_test = np.array(y_test)
y_train = np.reshape(y_train, (7000,1))
y_test = np.reshape(y_test, (3000,1))
kernels = [1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))]
gp = GaussianProcessRegressor(kernel=kernels)
gp.fit(X_train, y_train)
File "<ipython-input-23-5a576449fdb6>", line 1, in <module>
gp.fit(X_train, y_train)
File "C:\Program Files\Anaconda\lib\site-packages\sklearn\gaussian_process\gpr.py", line 203, in fit
if self.optimizer is not None and self.kernel_.n_dims > 0:
AttributeError: 'list' object has no attribute 'n_dims'
When initializing the GaussianProcessRegressor(kernel=kernels) the argument passed as kernel has to be a kernel object. You are passing a list.
More information in the documentation here.
I have a set of documents and a set of labels.
Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
When I try the method provided in the documentation of scikit learn: I receive an error that says:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
error
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
How can I perform a 10 fold cross-validation on my documents and labels?
There are two ways to solve this error:
First way:
Cast your data to a numpy array:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
then it should work with your current code.
Second way:
Use list comprehension to index the train & label list with the train_index & test_index list
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(For this solution also see related question index list with another list)
This is my first time posting here. For the past couple of days I have been trying to teach myself scikit-learn. But recently I have encountered an error that has been nagging me for quite some time.
My goal is simply to train a NB classifier cli so that I can feed it an arbitrary list of strings called new_doc and it will predict what class the string is likely to belong to.
This is what my program looks like:
#Importing stuff
import numpy as np
import pylab
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn import metrics
#Opening the csv file
df = pd.read_csv('data.csv', sep=',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
#Extracting features from text, define target y and data X
vect = CountVectorizer()
X = vect.fit_transform(df['Features'])
y = df['Target']
#Partitioning the data into test and training set
SPLIT_PERC = 0.75
split_size = int(len(y)*SPLIT_PERC)
X_train = X[:split_size]
X_test = X[split_size:]
y_train = y[:split_size]
y_test = y[split_size:]
#Training the model
clf = MultinomialNB()
clf.fit(X_train, y_train)
#Evaluating the results
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
#Predicting new data
new_doc = ["MacDonalds", "Walmart", "Target", "Starbucks"]
trans_doc = vect.transform(new_doc) #extracting features
y_pred = clf.predict(trans_doc) #predicting
But when I run the program I get the following error on the last row:
y_pred = clf.predict(trans_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 62, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 441, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 175, in safe_sparse_dot
ret = a * b
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 334, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
So apparently it has something to do with the dimension of the term-document matrixes.
When I check the dimensions of trans_doc, X_train and X_test i get:
>>> trans_doc.shape
(4, 4)
>>> X_train.shape
(145314, 28750)
>>> X_test.shape
(48439, 28750)
In order for y_pred = clf.predict(trans_doc) to work I need to (from what I understand it) transform new_doc into a term-document matrix with the dimensions (4, 28750). But I don't know of any methods within CountVectorizer that lets me do this.