Inverse Transform with FunctionTransformer from sklearn - python

I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))

Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.

Related

Different output while using fit_transform vs fit and transform from sklearn

The following code snippet illustrates the issue:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
(nrows, ncolumns) = (1912392, 131)
X = np.random.random((nrows, ncolumns))
pca = PCA(n_components=28, random_state=0)
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)
print((transformed_X1 != transformed_X2).sum()) # Gives output as 53546976
scalar = StandardScaler()
scaled_X1 = scalar.fit_transform(X)
scalar2 = scalar.fit(X)
scaled_X2 = scalar2.transform(X)
(scaled_X1 != scaled_X2).sum() # Gives output as 0
Can someone explain as to why the first output is not zero and the second output is?
Using this works:
pca = PCA(n_components=28, svd_solver = 'full')
transformed_X1 = pca.fit_transform(X)
pca1 = pca.fit(X)
transformed_X2 = pca1.transform(X)
print(np.allclose(transformed_X1, transformed_X2))
True
Apparently svd_solver = 'random' (which is what 'auto' defaults to) has enough process difference between .fit(X).transform(X) and fit_transform(X) to give different results even with the same seed. Also remember floating point errors make == and /= unreliable judges of equality of different processes, so use np.allclose().
It seems like StandardScaler.fit_transform() just directly uses .fit(X).transform(X) under the hood, so there were no floating point errors there to trip you up.

How to loop through multiple polynomial fits changing the degree

My code functions properly but I am repeating a block several times to vary the polynomial variable, degree. I assume this can and should be looped to allow quicker iterations, but I'm not sure how to do it. Prior to the code below I generate the train_test split which I keep for plotting.
After several iterations, I use np.vstack on the y_predictions to create a single array.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
### degree 1 ####
poly1 = PolynomialFeatures(degree=1)
x1_poly = poly1.fit_transform(X_train)
linreg1 = LinearRegression().fit(x1_poly, y_train)
pred_1= poly1.transform(x_prediction_data)
y1_poly_pred=linreg1.predict(pred_1)
### degree 3 #####
poly3 = PolynomialFeatures(degree=3)
x3_poly = poly3.fit_transform(X_train)
linreg3 = LinearRegression().fit(x3_poly, y_train)
pred_3= poly3.transform(x_prediction_data)
y3_poly_pred=linreg3.predict(pred_3)
#### ect... will make several other degree = 6, 9 ...
I would recommend collecting your answers in a dictionary, but I created a list for simplicity.
The code iterates over i, which is the degree of your polynomials. Trains the model, etc..., then collects its answers.
prediction_collector = []
for i in [1,3,6,9]:
poly = PolynomialFeatures(degree=i)
x_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(x_poly, y_train)
pred= poly.transform(x_prediction_data)
y_poly_pred=linreg.predict(pred)
# to collect the answer after each iteration/increase of degrees
predictions_collector.append(y_poly_pred)

scaling data to specific range in python

I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?
You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min
MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range

How to get the predict probability in Machine Leaning

I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)
loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.
To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!
Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

Categories

Resources