How to get the predict probability in Machine Leaning - python

I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)

loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.

To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!

Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]

Related

Why does sklearn KMeans changes my dataset after fitting?

I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler and I use OneHotEncoder to dummy the categorical variable "Private".
My code is:
num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])
ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()
km = KMeans(n_cluster = 6)
km.fit(data)
The dataset before using the KMeans:
The dataset after using the KMeans:
It appears that when you run km.fit(data), the .fit method modifies data inplace by inserting a column that is the opposite of your one-hot encoded column. And also confusing is the fact that the "Terminal" column disappears.
For now, you can use this workaround that copies your data:
data1 = data.copy()
km = KMeans(n_clusters = 6, n_init = 'auto')
km.fit(data1)
Edit: When you run km.fit, the first method that is run is km._validate_data – which is a validation step that modifies the dataframe that you pass (see here and here)
For example, if I add the following to the end of your code:
km._validate_data(
data,
accept_sparse="csr",
dtype=[np.float64, np.float32],
order="C",
accept_large_sparse=False,
)
Running this changes your data, but I don't know exactly why this is happening. It may have to do with something about the data itself.
There's a subtle bug in the posted code. Let's demonstrate it:
new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
OneHotEncoder returns something like this:
new_data = np.array(
[[0, 1],
[0, 1],
[1, 0]])
What happens if we assign new_df["Private"] with our new (3, 2) array?
>>> new_df["Private"] = new_data
>>> print(new_df)
Private
0 0
1 0
2 1
Wait, where'd the other column go?
Uh oh, it's still in there ...
... but it's invisible until we look at the actual values:
>>> print(new_df.values)
[[0 1]
[0 1]
[1 0]]
As #Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder.
Is there a better way?
Yep, use a pipeline!
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
],
remainder=StandardScaler(),
),
KMeans(n_clusters=6),
)
out = pipe.fit(df)
The data is the same but shifted over by one column. The Apps column never existed before and everything is shifted to the right.
It has something to do with your line
data[num_vars] = scaler.fit_transform(data[num_vars])
which is actually doing a nested double array data[data[columns[1:]].
Basically, you can follow a method like this
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data[:, 1:] = sc.fit_transform(data[:, 1:])

Spliting dataset to train and test in python

I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?
Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.
You should write your own function to do this,
One way to do this is select rows by index and shuffle it after take them.
Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)

Inverse Transform with FunctionTransformer from sklearn

I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

Ordinal Ridge and Lasso Regression in Python

Thank you for taking your time to read my problem.
I have to run Ordinal Ridge and Lasso regression on my dataset. The values that I want to predict are ordinal (5 levels) and I have many predictors (over 60) that are continuous but not all of them are logically significant. So, I would like to run the Ordinal Regression using Lasso and Ridge to find the significant ones.
I am very new to python and I don't know really what to do and appreciate any help from the community.
I have found the mord module (and even if I am using it right), it doesn't provide Ordinal Lasso.
Could anyone help me with this, please?
Thanks in advance.
Update:
I have written the following code, I don't get any error and I get an accuracy lower than previous analyses. So, I assume I am making mistake at a point in how I am doing it. I would appreciate it if someone helps me with it. I guess it could be in scaling, but I don't know how.
"rel" has five values: 1,2,3,4,5 which are my predicted values.
import numpy as np
import pandas as pd
import mord
from sklearn.preprocessing import scale, StandardScaler
from sklearn.metrics import mean_squared_error
import csv
#defining a function to rotate numbers in an array
def leftRotatebyOne(arr, n):
temp = arr[0]
for i in range(n-1):
arr[i] = arr[i+1]
arr[n-1] = temp
#defining OR to do Ordinal Ridge Regression
OR = mord.OrdinalRidge()
#definign the loop to go through all participants
for s in range(17):
#reading the data for each participant
df = pd.read_csv("Complete{0}.csv".format(s+1), index_col=0, header=None).dropna()
df.index.name = 'subject{0}'.format(s+1)
df.columns = ["ch{0}".format(i+1) for i in range(64)] +["irrel", "rel"]
#defining output and predictors
y = df.rel
X = df.drop(['rel', 'irrel'], axis=1).astype('float64')
#an array containig trial numbers
T = np.array(range(480))
#defining a matrix to hold the models of all runs(480 one-leave_out) for each participants
out=np.empty((67,480))
#runing the model for all trials (each time keeping one out)
for t in range(480):
T1 = T[:479]
T2 = T[479:] #the last one which is going to be out
## Always the last one is going to be out, how it works is that we rotate T, so the last trail changes
#train samples
X_train = X.iloc[T1,:]
y_train = np.array(y.iloc[T1])
scaler = StandardScaler().fit(X_train)
#test sample
X_test = X.iloc[T2,:]
y_test = np.array(y.iloc[T2])
#rotating T
leftRotatebyOne(T,480)
#runing ordinal ridge regression from the module mord
OR.fit(scaler.transform(X_train), y_train)
predicted = OR.predict(scaler.transform(X_test))
error = mean_squared_error(y_test, predicted)
coeff = pd.Series(OR.coef_, index=X.columns)
#getting the accuracy of each prediction
if predicted == y_test:
accuracy = 1
else:
accuracy = 0
#having all results in a matrix (each column is for leaving out one of the trials)
out[:,t]=np.hstack((coeff,predicted,error, accuracy))
#saving the results for each participant
np.savetxt("reg{0}.csv".format(s+1), out, delimiter=',')
#saving all results in one file
filenames = ["reg{0}.csv".format(i+1) for i in range(17)]
dataframes = [pd.read_csv(p) for p in filenames]
merged_dataframe = pd.concat(dataframes, axis=1)
merged_dataframe.to_csv("merged.csv", index=False)
#reading the file that contains all the models for all the
participants
cl = pd.read_csv("merged.csv", header=None).dropna()
#naming the rows
cl.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
#calculating the mean of each row
print(pd.Series.mean(cl, axis=1))
#getting teh mean of accuracy for each participant
for s in range(17):
regg = pd.read_csv("reg{0}.csv".format(s+1), header=None).dropna()
regg.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
print(pd.Series.mean(regg, axis=1)[66])
I didn't find anything other than mord module.
I want to do a leave-one-out cross validation, and I have to just keep one of the samples for the test.
PS.
I am following instructions in this link:
http://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb
I get the following error with doing exactly as they have done:
module 'glmnet' has no attribute 'ElasticNet'
*However, they do not cover ordinal regression.
You can use sklearn for this,
from sklearn import linear_model
regr_lasso = linear_model.Lasso(alpha=0.1)
regr_ridge = linear_model.Ridge(alpha=1.0)
regr_elasticnet = linear_model.ElasticNet(random_state=0)
Refer below links for more details,
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html

Categories

Resources