Why does sklearn KMeans changes my dataset after fitting? - python

I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler and I use OneHotEncoder to dummy the categorical variable "Private".
My code is:
num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])
ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()
km = KMeans(n_cluster = 6)
km.fit(data)
The dataset before using the KMeans:
The dataset after using the KMeans:

It appears that when you run km.fit(data), the .fit method modifies data inplace by inserting a column that is the opposite of your one-hot encoded column. And also confusing is the fact that the "Terminal" column disappears.
For now, you can use this workaround that copies your data:
data1 = data.copy()
km = KMeans(n_clusters = 6, n_init = 'auto')
km.fit(data1)
Edit: When you run km.fit, the first method that is run is km._validate_data – which is a validation step that modifies the dataframe that you pass (see here and here)
For example, if I add the following to the end of your code:
km._validate_data(
data,
accept_sparse="csr",
dtype=[np.float64, np.float32],
order="C",
accept_large_sparse=False,
)
Running this changes your data, but I don't know exactly why this is happening. It may have to do with something about the data itself.

There's a subtle bug in the posted code. Let's demonstrate it:
new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
OneHotEncoder returns something like this:
new_data = np.array(
[[0, 1],
[0, 1],
[1, 0]])
What happens if we assign new_df["Private"] with our new (3, 2) array?
>>> new_df["Private"] = new_data
>>> print(new_df)
Private
0 0
1 0
2 1
Wait, where'd the other column go?
Uh oh, it's still in there ...
... but it's invisible until we look at the actual values:
>>> print(new_df.values)
[[0 1]
[0 1]
[1 0]]
As #Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder.
Is there a better way?
Yep, use a pipeline!
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
],
remainder=StandardScaler(),
),
KMeans(n_clusters=6),
)
out = pipe.fit(df)

The data is the same but shifted over by one column. The Apps column never existed before and everything is shifted to the right.
It has something to do with your line
data[num_vars] = scaler.fit_transform(data[num_vars])
which is actually doing a nested double array data[data[columns[1:]].
Basically, you can follow a method like this
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data[:, 1:] = sc.fit_transform(data[:, 1:])

Related

Spliting dataset to train and test in python

I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?
Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.
You should write your own function to do this,
One way to do this is select rows by index and shuffle it after take them.
Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

How to get the predict probability in Machine Leaning

I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)
loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.
To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!
Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]

Sklearn inverse_transform return only one column when fit to many

Is there a way to inverse_transform one column with sklearn, when the initial transformer was fit on the whole data set? Below is an example of what I am trying to get after.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
# Setting up a dummy pipeline
pipes = []
pipes.append(('scaler', MinMaxScaler()))
transformation_pipeline = Pipeline(pipes)
# Random data.
df = pd.DataFrame(
{'data1': [1, 2, 3, 1, 2, 3],
'data2': [1, 1, 1, 2, 2, 2],
'Y': [1, 4, 1, 2, 2, 2]
}
)
# Fitting the transformation pipeline
test = transformation_pipeline.fit_transform(df)
# Pulling the scaler function from the pipeline.
scaler = transformation_pipeline.named_steps['scaler']
# This is what I thought may work.
predicted_transformed = scaler.inverse_transform(test['Y'])
# The output would look something like this
# Essentially overlooking that scaler was fit on 3 variables and fitting
# the last one, or any I need.
predicted_transfromed = [1, 4, 1, 2, 2, 2]
I need to be able to fit the whole dataset as part of a data prep process. But then I am importing the scaler later into another instance with sklearn.externals joblibs. In this new instance the predicted values are the only thing that exists. So I need to extract just the inverse scaler for the Y column to get back the originals.
I am aware that I could fit one transformer for X variables and Y variables, However, I would like to avoid this. This method would add to the complexity of moving the scalers around and maintaining both of them in future projects.
A bit late but I think this code does what you are looking for:
# - scaler = the scaler object (it needs an inverse_transform method)
# - data = the data to be inverse transformed as a Series, ndarray, ...
# (a 1d object you can assign to a df column)
# - ftName = the name of the column to which the data belongs
# - colNames = all column names of the data on which scaler was fit
# (necessary because scaler will only accept a df of the same shape as the one it was fit on)
def invTransform(scaler, data, colName, colNames):
dummy = pd.DataFrame(np.zeros((len(data), len(colNames))), columns=colNames)
dummy[colName] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=colNames)
return dummy[colName].values
Note that you need to provide enough information to run use the inverse_transform method of the scaler object behind the scenes.
Similar problems. I have a multidimensional timeseries as input (a quantity and 'exogenous' variables), and one dimension (a quantity) as output. I am unable to invert the scaling to compare the forecast to the original test set, since the scaler expects a multidimensional input.
One solution I can think of is using separate scalers for the quantity and the exogenous columns.
Another solution I can think of is to give the scaler sufficient 'junk' columns just to fill out the dimensions of the array to be unscaled, then only look at the first column of the output.
Then, once I forecast, I can invert the scaling on the forecast to get values which I can compare to the test set.
Improving on what Willem said. This will work with less input.
def invTransform(scaler, data):
dummy = pd.DataFrame(np.zeros((len(data), scaler.n_features_in_)))
dummy[0] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=dummy.columns)
return dummy[0].values

Reshaping data for Linear regression [duplicate]

On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
This all works fine but if I have a new sample (temp below) that I want to classify (and thus I want to preprocess in the same way then I get
temp = [1,2,3,4,5,5,6,....................,7]
temp = scaler.transform(temp)
Then I get a deprecation warning...
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
So the question is how should I be rescaling a single sample like this?
I suppose an alternative (not very good one) would be...
temp = [temp, temp]
temp = scaler.transform(temp)
temp = temp[0]
But I'm sure there are better ways.
Just listen to what the warning is telling you:
Reshape your data either X.reshape(-1, 1) if your data has a single feature/column
and X.reshape(1, -1) if it contains a single sample.
For your example type(if you have more than one feature/column):
temp = temp.reshape(1,-1)
For one feature/column:
temp = temp.reshape(-1,1)
Well, it actually looks like the warning is telling you what to do.
As part of sklearn.pipeline stages' uniform interfaces, as a rule of thumb:
when you see X, it should be an np.array with two dimensions
when you see y, it should be an np.array with a single dimension.
Here, therefore, you should consider the following:
temp = [1,2,3,4,5,5,6,....................,7]
# This makes it into a 2d array
temp = np.array(temp).reshape((len(temp), 1))
temp = scaler.transform(temp)
This might help
temp = ([[1,2,3,4,5,6,.....,7]])
.values.reshape(-1,1) will be accepted without alerts/warnings
.reshape(-1,1) will be accepted, but with deprecation war
I faced the same issue and got the same deprecation warning. I was using a numpy array of [23, 276] when I got the message. I tried reshaping it as per the warning and end up in nowhere. Then I select each row from the numpy array (as I was iterating over it anyway) and assigned it to a list variable. It worked then without any warning.
array = []
array.append(temp[0])
Then you can use the python list object (here 'array') as an input to sk-learn functions. Not the most efficient solution, but worked for me.
You can always, reshape like:
temp = [1,2,3,4,5,5,6,7]
temp = temp.reshape(len(temp), 1)
Because, the major issue is when your, temp.shape is:
(8,)
and you need
(8,1)
-1 is the unknown dimension of the array. Read more about "newshape" parameters on numpy.reshape documentation -
# X is a 1-d ndarray
# If we want a COLUMN vector (many/one/unknown samples, 1 feature)
X = X.reshape(-1, 1)
# you want a ROW vector (one sample, many features/one/unknown)
X = X.reshape(1, -1)
from sklearn.linear_model import LinearRegression
X = df[['x_1']]
X_n = X.values.reshape(-1, 1)
y = df['target']
y_n = y.values
model = LinearRegression()
model.fit(X_n, y)
y_pred = pd.Series(model.predict(X_n), index=X.index)

Categories

Resources