I'm working on the basics of machine learning with the iris dataset. I think I understand the idea of splitting data and making predictions on new data; however, I'm having trouble understanding the results I get for the code below:
iris = load_iris()
X = iris.data
y = iris.target
len(X)--result: 150
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(y_pred)
print(metrics.accuracy_score(y_test, y_pred))
Result: [1 2 2 0 2 1 0 2 0 1 1 2 2 2 0 0 2 2 0 0 1 2 0 2 1 2 1 1 1 2 0 1 1 0 1 0 0
2]
0.95% accuracy
I only get back 38 results. From what I understand, the data is split into 50 50 chunks, meaning I should get back 50 results for the data not part of the train and test data. Why do I get only 38?
I feel like my biggest question regarding Machine Learning is actually using the model.
By default train_test_split set test_size to 0.25. In case of 50 it will be 12.5, so 38 values are correct.
sklearn.model_selection.train_test_split
Related
So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)
I am doing a binary classification. May I know how to extract the real indexes of the misclassified or classified instances of the training data frame while doing K fold cross-validation? I found no answer to this question here.
I got the values in folds as described here:
skf=StratifiedKFold(n_splits=10,random_state=111,shuffle=False)
cv_results = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')
fold_pred = [pred[j] for i, j in skf.split(X_train,y_train)]
fold_pred
Is there any method to get index of misclassified (or classified ones)? So the output is a dataframe that only has misclassified(or classified) instances while doing cross validation.
Desired output:
Missclassified instances in the dataframe with real indices.
col1 col2 col3 col4 target
13 0 1 0 0 0
14 0 1 0 0 0
18 0 1 0 0 1
22 0 1 0 0 0
where input has 100 instances, 4 are misclassified (index number 13,14,18 and 22) while doing CV
From cross_val_predict you already have the predictions. It's a matter of subsetting your data frame where the predictions are not the same as your true label, for example:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data[:,:5],columns=data.feature_names[:5])
df['label'] = data.target
rfc = RandomForestClassifier()
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
pred = cross_val_predict(rfc, df.iloc[:,:5], df['label'], cv=skf)
df[df['label']!=pred]
mean radius mean texture ... mean smoothness label
3 11.42 20.38 ... 0.14250 0
5 12.45 15.70 ... 0.12780 0
9 12.46 24.04 ... 0.11860 0
22 15.34 14.26 ... 0.10730 0
31 11.84 18.70 ... 0.11090 0
I have a dataframe ready for modelling, it contains continuous variables and one-hot-encoded variables
ID Limit Bill_Sep Bill_Aug Payment_Sep Payment_Aug Gender_M Gender_F Edu_Uni DEFAULT_PAYMT
1 10000 2000 350 1000 350 1 0 1 1
2 30000 3000 5000 500 500 0 1 0 0
3 20000 8000 10000 8000 5000 1 0 1 1
4 45000 450 250 450 250 0 1 0 1
5 60000 700 1000 700 1000 1 0 1 1
6 8000 300 5000 300 2000 1 0 1 0
7 30000 3000 10000 1000 5000 0 1 1 1
8 15000 1000 1250 500 1750 0 1 1 1
All the numerical variables are 'int64' while the one-hot-encoded variables are 'uint8'. The binary outcome variable is DEFAULT_PAYMT.
I have gone down the usual manner of train test split here, but i wanted to see if i could apply the standardscaler only for the int64 variables (i.e., the variables that were not one-hot-encoded)?
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
Am attempting the following code and seems to work, however, am not sure how to merge the categorical variables (that were not scaled) back into the X_scaled_tr and X_scaled_t arrays. Appreciate any form of help, thank you!
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_scaled_tr = X_train.select_dtypes(include=['int64'])
X_scaled_t = X_test.select_dtypes(include=['int64'])
X_scaled_tr = sc.fit_transform(X_scaled_tr)
X_scaled_t = sc.transform(X_scaled_t)
Managed to address the question with the following code where standardscaler is only applied to the continuous variables and NOT the one-hot-encoded variables
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)
I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.
Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.
I am new to Machine Learning, however, a veteran programmer....
I have a lot of data about Customer/Agent interactions, with ratings for these interactions as being positive/negative from the customer perspective... I also have lots of features about the customer (Age, Gender, previous spend, products purchased,....etc)
I want to train a model that can learn from Customer Features who is the best Agent to deal with them that would potentially produce the highest rating... Assuming that similar customers (similar features) would lead to the Agent being able to serve them in the same way.....
Assume the following pandas Dataframe: dataset
AgentID Score Cust_F1 Cust_F2 Cust_F3 ..... Cust_Fn
0 1 10 1 0 1 2
1 1 0 0 1 2 0
2 1 9 1 2 1 2
3 2 10 0 1 1 1
4 2 9 0 1 2 1
5 2 0 1 0 2 2
X = dataset.drop([['AgendID','Score']],1).values
y = dataset['AgentID'].values
clf = RandomForestClassifier(n_estimators=100, random_state=1)
clf.fit(X,y)
I want a way to train the model to reject (negative train) all samples with Score = 0. I cannot find a way to do this with sklearn... Of course, I can remove samples with Scores = 0 from the training data, however, I believe they are very valuable information that would help the algorithm to properly classify...
I also looked at sample_weight parameter and i thought if I put negative values there it would help, however, the documentation doesn't mention this...
Can someone please help me...