How do I standardize only int64 columns after train-test split? - python

I have a dataframe ready for modelling, it contains continuous variables and one-hot-encoded variables
ID Limit Bill_Sep Bill_Aug Payment_Sep Payment_Aug Gender_M Gender_F Edu_Uni DEFAULT_PAYMT
1 10000 2000 350 1000 350 1 0 1 1
2 30000 3000 5000 500 500 0 1 0 0
3 20000 8000 10000 8000 5000 1 0 1 1
4 45000 450 250 450 250 0 1 0 1
5 60000 700 1000 700 1000 1 0 1 1
6 8000 300 5000 300 2000 1 0 1 0
7 30000 3000 10000 1000 5000 0 1 1 1
8 15000 1000 1250 500 1750 0 1 1 1
All the numerical variables are 'int64' while the one-hot-encoded variables are 'uint8'. The binary outcome variable is DEFAULT_PAYMT.
I have gone down the usual manner of train test split here, but i wanted to see if i could apply the standardscaler only for the int64 variables (i.e., the variables that were not one-hot-encoded)?
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
Am attempting the following code and seems to work, however, am not sure how to merge the categorical variables (that were not scaled) back into the X_scaled_tr and X_scaled_t arrays. Appreciate any form of help, thank you!
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_scaled_tr = X_train.select_dtypes(include=['int64'])
X_scaled_t = X_test.select_dtypes(include=['int64'])
X_scaled_tr = sc.fit_transform(X_scaled_tr)
X_scaled_t = sc.transform(X_scaled_t)

Managed to address the question with the following code where standardscaler is only applied to the continuous variables and NOT the one-hot-encoded variables
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)

Related

How to get indices of instances during cross-validation

I am doing a binary classification. May I know how to extract the real indexes of the misclassified or classified instances of the training data frame while doing K fold cross-validation? I found no answer to this question here.
I got the values in folds as described here:
skf=StratifiedKFold(n_splits=10,random_state=111,shuffle=False)
cv_results = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')
fold_pred = [pred[j] for i, j in skf.split(X_train,y_train)]
fold_pred
Is there any method to get index of misclassified (or classified ones)? So the output is a dataframe that only has misclassified(or classified) instances while doing cross validation.
Desired output:
Missclassified instances in the dataframe with real indices.
col1 col2 col3 col4 target
13 0 1 0 0 0
14 0 1 0 0 0
18 0 1 0 0 1
22 0 1 0 0 0
where input has 100 instances, 4 are misclassified (index number 13,14,18 and 22) while doing CV
From cross_val_predict you already have the predictions. It's a matter of subsetting your data frame where the predictions are not the same as your true label, for example:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data[:,:5],columns=data.feature_names[:5])
df['label'] = data.target
rfc = RandomForestClassifier()
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
pred = cross_val_predict(rfc, df.iloc[:,:5], df['label'], cv=skf)
df[df['label']!=pred]
mean radius mean texture ... mean smoothness label
3 11.42 20.38 ... 0.14250 0
5 12.45 15.70 ... 0.12780 0
9 12.46 24.04 ... 0.11860 0
22 15.34 14.26 ... 0.10730 0
31 11.84 18.70 ... 0.11090 0

ValueError: Found array with 1 feature(s) while a minimum of 2 is required

I applied Random Forest RFECV among other ML models to a churn dataset.
While Logistic, SVC, Gradient Boosting, Decision Trees worked well on the data (all using RFECV),
Random Forest RFECV decided that only one feature was important and eliminated all the other features.
Code:
#Create Feature variable X and Target variable y
y = churn_dataset['Churn']
X = churn_dataset.drop(['Churn'], axis = 1)
#RFECV
rfecv = RFECV(RandomForestClassifier(), cv=10, scoring='f1')
rfecv = rfecv.fit(X, y)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X.columns[rfecv.support_])
print(np.where(rfecv.support_ == False)[0])
#drop columns
X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1, inplace=True)
rfecv.estimator_.feature_importances_
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=8)
#fit model
random_forest = rfecv.fit(X_train, y_train)
The following error is returned:
ValueError: Found array with 1 feature(s) (shape=(1622, 1)) while a minimum of 2 is required.
Output of churn_dataset.head()
name gender churn last_purchase_in_days order_count purchase_quantity ...
2 ACKLE 0 1 0.317604 -0.453647 2 -0.368683 1.173058 0.291104 0 ... 0 0 0 0 0 0 1 0 0 1.00
4 ADNAN 1 1 0.250814 -0.453647 2 -0.368683 -0.431351 -0.418023 0 ... 0 0 0 0 0 0 1 0 0 1.00
5 ADY 0 1 -1.143415 -0.453647 2 -0.368683 0.190767 -0.117630 0 ... 0 0 0 0 0 0 1 0 0 1.00
6 ANDY 0 1 0.768432 -0.453647 2 -0.368683 -0.752232 -0.559952 0 ... 0 0 0 0 0 0 1 0 0 1.00
7 AGIE 0 0 -1.669381 3.048875 8 -0.368683 0.520653 4.251851 0 ... 0 0 0 0 0 0 1 0 0 0.16
churn_dataset.columns
Index(['name', 'gender', 'Churn', 'last_purchase_in_days',
'order_count', 'quantity', 'disc_code',
'AOV', 'sales',
'channel_Paid Advertising','channel_Recurring Payment',
'channel_Search Engine',
'channel_Social Media', 'country_Denmark', 'country_France',
'country_Germany', 'country_Italy',
'country_Luxembourg', 'country_Others', 'country_Switzerland',
'country_United Kingdom', 'city_Düsseldorf', 'city_Frankfurt',
'city_Hamburg', 'city_Hannover', 'city_Köln', 'city_Leipzig',
'city_Munich', 'city_Others', 'city_Stuttgart', 'city_Wien',
'Probability_of_Churn'],
dtype='object')

How to intepret iris data set result?

I'm working on the basics of machine learning with the iris dataset. I think I understand the idea of splitting data and making predictions on new data; however, I'm having trouble understanding the results I get for the code below:
iris = load_iris()
X = iris.data
y = iris.target
len(X)--result: 150
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(y_pred)
print(metrics.accuracy_score(y_test, y_pred))
Result: [1 2 2 0 2 1 0 2 0 1 1 2 2 2 0 0 2 2 0 0 1 2 0 2 1 2 1 1 1 2 0 1 1 0 1 0 0
2]
0.95% accuracy
I only get back 38 results. From what I understand, the data is split into 50 50 chunks, meaning I should get back 50 results for the data not part of the train and test data. Why do I get only 38?
I feel like my biggest question regarding Machine Learning is actually using the model.
By default train_test_split set test_size to 0.25. In case of 50 it will be 12.5, so 38 values are correct.
sklearn.model_selection.train_test_split

Anomaly detection with mean shift sklearn

I'm trying to use mean shift from sklearn to find anomalies and outliers in a dataset. The datasets are signal values from sensors. I have a training dataset to train the algorithm and a test dataset containing dummy anomalies. My problem is that when I use the predict method on test dataset, mean shift doesn't label anomalies with -1 or any other value that indicates anomalies or outliers but associates them with valid cluster.
Here the code:
import pandas as pd
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import preprocessing
if __name__ == '__main__':
train= pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
scaler = preprocessing.StandardScaler().fit(train)
bandwidth = estimate_bandwidth(train, n_jobs=-1)
ms = MeanShift(bandwidth=bandwidth,n_jobs=-1)
ms.fit(scaler.transform(train))
prediction = ms.predict(scaler.transform(test))
test["cluster"] = prediction
print np.unique(prediction)
here first 5 row training dataset:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 300 1 350
4 400 1 350
here first 5 row test dataset with dummy anomalies:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 100000000 100000000 100000000
4 400 1 350
what can i do to detect anomalies in test dataset?

GridSearchCV for number of neurons

I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.
Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.

Categories

Resources