Spliting dataset to train and test in python - python

I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?

Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.

You should write your own function to do this,
One way to do this is select rows by index and shuffle it after take them.

Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)

Related

Why does sklearn KMeans changes my dataset after fitting?

I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler and I use OneHotEncoder to dummy the categorical variable "Private".
My code is:
num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])
ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()
km = KMeans(n_cluster = 6)
km.fit(data)
The dataset before using the KMeans:
The dataset after using the KMeans:
It appears that when you run km.fit(data), the .fit method modifies data inplace by inserting a column that is the opposite of your one-hot encoded column. And also confusing is the fact that the "Terminal" column disappears.
For now, you can use this workaround that copies your data:
data1 = data.copy()
km = KMeans(n_clusters = 6, n_init = 'auto')
km.fit(data1)
Edit: When you run km.fit, the first method that is run is km._validate_data – which is a validation step that modifies the dataframe that you pass (see here and here)
For example, if I add the following to the end of your code:
km._validate_data(
data,
accept_sparse="csr",
dtype=[np.float64, np.float32],
order="C",
accept_large_sparse=False,
)
Running this changes your data, but I don't know exactly why this is happening. It may have to do with something about the data itself.
There's a subtle bug in the posted code. Let's demonstrate it:
new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
OneHotEncoder returns something like this:
new_data = np.array(
[[0, 1],
[0, 1],
[1, 0]])
What happens if we assign new_df["Private"] with our new (3, 2) array?
>>> new_df["Private"] = new_data
>>> print(new_df)
Private
0 0
1 0
2 1
Wait, where'd the other column go?
Uh oh, it's still in there ...
... but it's invisible until we look at the actual values:
>>> print(new_df.values)
[[0 1]
[0 1]
[1 0]]
As #Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder.
Is there a better way?
Yep, use a pipeline!
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
],
remainder=StandardScaler(),
),
KMeans(n_clusters=6),
)
out = pipe.fit(df)
The data is the same but shifted over by one column. The Apps column never existed before and everything is shifted to the right.
It has something to do with your line
data[num_vars] = scaler.fit_transform(data[num_vars])
which is actually doing a nested double array data[data[columns[1:]].
Basically, you can follow a method like this
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data[:, 1:] = sc.fit_transform(data[:, 1:])

Python prepare training data set with evenly distrubeted response variable

I am working on a small machine learning project.
The dataset which i use has 56 input parameters and one categorical response variable (0/1). My problem is that the response variables are not evenly distributed. Now my question I want to prepare the training data set, that the responses are evenly distributed. How can this be done?
That's how the data looks like
-> the training dataset should have the same amount of 1 and 0 from the response.
Thanks for your help, as you can imagine i am really a beginner...
i am the same person like the one who asked the question. sorry for that.
first i load the data from a csv file.(not in the code shown here) this is stored as data, next, i create a new column named " response_class" based on the value in the column "response" if it is below .045, response_class =1, other 0. second, i randomly sample 10000 rows from the data. (due to computation limits), and here i want to make sure that i get the same amount of 1 and 0 from the response_class. at the end i split the data to make it ready for a correlation matrix and test and train data
Here is my code:
data = data[data.response != 0]
pd.DataFrame(data)
data['response_class'] = np.where(data['response'] <= 0.045, 1, 0)
#1=below .045 / 0=above 0.045
#reduce amount of data by picking random samples
data= data.sample(n=10000)
#split data
data.drop(['response'], axis=1, inplace=True)
y = data['response_class']
X = data.drop('response_class', axis=1)
X_names = X.columns
data.head()
found a solution:
#seperate based on the response variable in response_class
df_zero = pd.DataFrame(data[data.response_class== 0])
df_one = pd.DataFrame(data[data.response_class == 1])
# upsampling minority class
df_zero_min = resample(df_zero,
replace = True,
n_samples = len(df_one),
random_state = 123)
df_upsampled = pd.concat([df_one,df_zero_min])
df_upsampled.response_class.value_counts()

How to get the predict probability in Machine Leaning

I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)
loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.
To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!
Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]

Python pandas.dataframe.isin returning unexpected results

I have encountered this several times where I'm trying to filter a dataframe using a column from another dataframe. isin incorrectly returns true for every row. It is probably just a misunderstanding on my part as to how it should work. Why is it doing this, and is there a better way to code it?
#Read the data into a pandas dataframe
ar_data = pd.read_excel('~/data/Accounts-Receivable.xlsx')
ar_data.set_index('customerID', inplace=True)
#randomly select records for 70/30 train/test split
train = ar_data.sample(frac=.7, random_state = 1)
mask = ~ar_data.index.isin(list(train.index)) #why does this return False for every value?
test = ar_data[mask]
ar_data.shape #returns (2466, 11)
train.shape #(1726, 11)
test.shape #returns (0, 11). Should return 740 rows!
Example
I tried to execute you code with a sample DataFrame and it works:
import pandas as pd
ar_data = [[10,20],[11,2],[9,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (1,1)
The problem you may probably have is that the index you used is not a key, hence there are multiple lines with the same Customer_id.
In fact executing your code with duplicated indexes leads to the bug you encountered.
import pandas as pd
ar_data = [[10,20],[10,2],[10,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (0,1)
Anyways an easier and faster way to split your dataset would be:
from sklearn.model_selection import train_test_split
X = ar_data
y = ar_data
train, test, _, _ = train_test_split(X,y,test_size=0.3,random_state=1)
with that possibility, you can also split the features and the predictions with only one function, and it doesn't rely on the indexes.

Sklearn inverse_transform return only one column when fit to many

Is there a way to inverse_transform one column with sklearn, when the initial transformer was fit on the whole data set? Below is an example of what I am trying to get after.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
# Setting up a dummy pipeline
pipes = []
pipes.append(('scaler', MinMaxScaler()))
transformation_pipeline = Pipeline(pipes)
# Random data.
df = pd.DataFrame(
{'data1': [1, 2, 3, 1, 2, 3],
'data2': [1, 1, 1, 2, 2, 2],
'Y': [1, 4, 1, 2, 2, 2]
}
)
# Fitting the transformation pipeline
test = transformation_pipeline.fit_transform(df)
# Pulling the scaler function from the pipeline.
scaler = transformation_pipeline.named_steps['scaler']
# This is what I thought may work.
predicted_transformed = scaler.inverse_transform(test['Y'])
# The output would look something like this
# Essentially overlooking that scaler was fit on 3 variables and fitting
# the last one, or any I need.
predicted_transfromed = [1, 4, 1, 2, 2, 2]
I need to be able to fit the whole dataset as part of a data prep process. But then I am importing the scaler later into another instance with sklearn.externals joblibs. In this new instance the predicted values are the only thing that exists. So I need to extract just the inverse scaler for the Y column to get back the originals.
I am aware that I could fit one transformer for X variables and Y variables, However, I would like to avoid this. This method would add to the complexity of moving the scalers around and maintaining both of them in future projects.
A bit late but I think this code does what you are looking for:
# - scaler = the scaler object (it needs an inverse_transform method)
# - data = the data to be inverse transformed as a Series, ndarray, ...
# (a 1d object you can assign to a df column)
# - ftName = the name of the column to which the data belongs
# - colNames = all column names of the data on which scaler was fit
# (necessary because scaler will only accept a df of the same shape as the one it was fit on)
def invTransform(scaler, data, colName, colNames):
dummy = pd.DataFrame(np.zeros((len(data), len(colNames))), columns=colNames)
dummy[colName] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=colNames)
return dummy[colName].values
Note that you need to provide enough information to run use the inverse_transform method of the scaler object behind the scenes.
Similar problems. I have a multidimensional timeseries as input (a quantity and 'exogenous' variables), and one dimension (a quantity) as output. I am unable to invert the scaling to compare the forecast to the original test set, since the scaler expects a multidimensional input.
One solution I can think of is using separate scalers for the quantity and the exogenous columns.
Another solution I can think of is to give the scaler sufficient 'junk' columns just to fill out the dimensions of the array to be unscaled, then only look at the first column of the output.
Then, once I forecast, I can invert the scaling on the forecast to get values which I can compare to the test set.
Improving on what Willem said. This will work with less input.
def invTransform(scaler, data):
dummy = pd.DataFrame(np.zeros((len(data), scaler.n_features_in_)))
dummy[0] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=dummy.columns)
return dummy[0].values

Categories

Resources