Scikit learn for predicting likelihood based on two values - python

New to python, building a classifier that predicts likelihood of vaccination if trust in government (trustingov) and trust in public health (poptrusthealth) from the dataset is greater than a certain percentage. Not sure how to get both as classes.
UPDATE: Concatenated the dataframe values, but why is the accuracy of the model 1.0?
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
df = pd.read_csv("covidpopulation2.csv")
print(df.head())
99853 8254 219 0.649999976 0.80763793
0 99853 8254 219 0.65 0.807638
1 48490 4007 227 0.49 0.357625
2 190179 8927 107 0.54 0.853186
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
99853 8254 219 0.649999976 0.80763793
count 1.342500e+04 13425.000000 13425.000000 13425.000000 13425.000000
mean 3.095292e+05 20555.570056 225.864655 0.473157 0.684484
std 5.070872e+05 28547.608184 218.078176 0.184501 0.167985
min 1.225700e+04 26.000000 2.000000 0.000000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.370000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.490000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.630000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.770000 0.983146
df = pd.read_csv("covidpopulation2.csv", na_values = ['?'], names = ['covidcases','coviddeaths','mortalityperm','trustngov','poptrusthealth'])
print(df.head())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
0 99853 8254 219 0.65 0.807638
1 99853 8254 219 0.65 0.807638
2 48490 4007 227 0.49 0.357625
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
df.dropna(inplace=True)
In [212]:
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
all_features = df[['covidcases',
'coviddeaths',
'mortalityperm',
'trustngov',
'poptrusthealth',]].values
all_classes = (df['poptrusthealth'].values + df['trustngov'].values)
willing = 0
unwilling = 0
label = [None] * 13426
for i in range (len(all_classes)):
if all_classes[i] > 0.70:
willing += 1
label[i] = 1
else:
unwilling = unwilling + 1
label[i] = 0
print(willing)
print(unwilling)
all_classes = label
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
from sklearn.model_selection import train_test_split
np.random.seed(1234)
(training_inputs,testing_inputs,training_classes,testing_classes) = train_test_split(all_features_scaled,all_classes,train_size = 0.8,test_size = 0.2,random_state = 1)
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=1)
clf.fit(training_inputs, training_classes)
DecisionTreeClassifier(random_state=1)
print(clf)
DecisionTreeClassifier(random_state=1)
print('the accuracy of the decision tree is:',clf.score(testing_inputs, testing_classes))
the accuracy of the decision tree is: 1.0
import pydotplus
from sklearn import tree
import collections
import graphviz
feature_names = ['covidcases','coviddeaths', 'mortalityperm','trustngov',
'poptrusthealth']
dot_data = tree.export_graphviz(clf, feature_names = feature_names, out_file =None, filled = True, rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('turquoise','orange')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
for edge in edges:
edges[edge].sort()
for i in range (2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('tree.png')
Any help or ideas would be appreciated.

Sorry, but this makes no sense from a machine learning point of view. Your label is directly created from the input features. That's why the model accuracy is 100%.
Here is your final classifier (without needing any machine learning):
if trustingov + poptrusthealth > 0.7 predict 1, otherwise predict 0.

It is perfectly possible to have 100% accuracy with training data, as the ML algorithm know them.
You have to apply your ML to data not used during the learning phase. It is usually done by splitting data into a training data set and a test data set.
Then you train/fit the ML with train data only. Then test it and calculate accuracy on test data. The test data result/Accuracy will tell you if your ML is well trained and working.
Unused test data is important to do a good ML test. So you will find unbiased accuracy of it.

Related

Erasing outliers from a dataframe in python

For an assignment I have to erase the outliers of a csv based on the different method
I tried working with the variable 'height' of the csv after opening the csv into a panda dataframe, but it keeps giving me errors or not touching the outliers at all, all this trying to use KNN method in python
The code that I wrote is the following
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
df = pd.read_csv("data.csv")
print(df.describe())
print(df.columns)
df['height'].plot(kind='hist')
print(df['height'].value_counts())
data= pd.DataFrame(df['height'],df['active'])
k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()
And the data it something like this:
id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1
I dont know what Im missing or doing wrong
df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()
knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0 1.000000
1 1.000000
2 1.000000
3 3.000000
4 1.000000
5 1.000000
6 133.958949
7 100.344407
...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()
MAX_DIST = 10
X[distances < MAX_DIST]
height weight
0 162 78.0
1 162 78.0
2 151 76.0
3 151 76.0
4 171 84.0
...
And finally to filter out all the outliers:
MAX_DIST = 10
X = X[X.distances < MAX_DIST]

How to select columns of a data base to call a linear regression (OLS and lasso) in sklearn

I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
You can maybe try patsy which is used by statsmodels:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mat = dmatrix("disp + qsec + C(cyl)", mtcars)
Looks like this, we can omit first column intercept since it is included in sklearn:
mat
DesignMatrix with shape (32, 5)
Intercept C(cyl)[T.6] C(cyl)[T.8] disp qsec
1 1 0 160.0 16.46
1 1 0 160.0 17.02
1 0 0 108.0 18.61
1 1 0 258.0 19.44
1 0 1 360.0 17.02
X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])
But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:
pd.Series(model.coef_,index = X.columns)
C(cyl)[T.6] -5.087564
C(cyl)[T.8] -5.535554
disp -0.025860
qsec -0.162425
Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.
Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
import numpy as np
from sklearn.linear_model import LinearRegression
Single variable regression mpg (i.v.) ~ hp (d.v.):
lm = LinearRegression()
mat = np.matrix(df)
lmFit = lm.fit(mat[:,3], mat[:,0])
print(lmFit.coef_)
print(lmFit.intercept_)
For multiple regression drat ~ wt + cyl + carb:
lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)
lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)

Error using the 'predict' function for a logistic regression

I am trying to fit a multinomial logistic regression and then predicting the result from samples.
### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()
[Output]:
MAP Sr delTC Mean_Treecover
302993741 2159.297363 452.975647 2.666672 1.0
217364332 3242.351807 65.615341 8.000000 1.0
390863334 1617.215454 493.124054 5.666666 0.0
446559668 1095.183105 498.373383 -8.000000 0.0
246078364 2804.615234 98.981110 -4.000000 1.0
1000000 rows × 7 columns
#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
print(model.summary2())
[Output]:
Results: MNLogit
====================================================================
Model: MNLogit Pseudo R-squared: 0.364
Dependent Variable: Mean_Treecover AIC: 831092.4595
Date: 2021-04-02 13:51 BIC: 831139.7215
No. Observations: 1000000 Log-Likelihood: -4.1554e+05
Df Model: 3 LL-Null: -6.5347e+05
Df Residuals: 999996 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
--------------------------------------------------------------------
Mean_Treecover = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept -5.2200 0.0119 -438.4468 0.0000 -5.2434 -5.1967
MAP 0.0023 0.0000 491.0859 0.0000 0.0023 0.0023
Sr 0.0016 0.0000 90.6805 0.0000 0.0015 0.0016
delTC -0.0093 0.0002 -39.9022 0.0000 -0.0098 -0.0089
However, wherever I try to predict the using the model.predict() function, I get the following error.
prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627
Does anyone know how to troubleshoot this? Is there something that I might be doing wrong?
The model adds an intercept so you need to include that, using an example data:
from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])
RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()
You can see the dimensions of your fitted data:
model.model.exog[:5,]
Out[16]:
array([[1. , 0.33914763, 0.79358056, 0.3103758 ],
[1. , 0.45915785, 0.94991271, 0.27203524],
[1. , 0.55527662, 0.15122108, 0.80675951],
[1. , 0.18493681, 0.89854583, 0.66760684],
[1. , 0.38300074, 0.6945397 , 0.28128137]])
Which is the same as if you add a constant:
import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])
const MAP Sr delTC
0 1.0 0.339148 0.793581 0.310376
1 1.0 0.459158 0.949913 0.272035
2 1.0 0.555277 0.151221 0.806760
3 1.0 0.184937 0.898546 0.667607
If you have a data.frame with the same column names, it will just be:
prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])
Or if you just need the fitted values, do:
model.fittedvalues

SVC sigmoid kernel is not working properly

I am testing an SVM with a sigmoid kernel on the iris data using sklearn and SVC. Its performance is extremely poor with an accuracy of 25 %. I'm using exactly the same code and normalizing the features as https://towardsdatascience.com/a-guide-to-svm-parameter-tuning-8bfe6b8a452c (sigmoid section) which should increase performance substantially. However, I am not able to reproduce his results and the accuracy only increases to 33 %.
Using other kernels (e.g linear kernel) produces good results (accuracy of 82 %).
Could there be an issue within the SVC(kernel = 'sigmoid') function?
Python code to reproduce problem:
##sigmoid iris example
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.svm import SVC
sepal_length = iris.data[:,0]
sepal_width = iris.data[:,1]
#assessing performance of sigmoid SVM
clf = SVC(kernel='sigmoid')
clf.fit(np.c_[sepal_length, sepal_width], iris.target)
pr=clf.predict(np.c_[sepal_length, sepal_width])
pd.DataFrame(classification_report(iris.target, pr, output_dict=True))
from sklearn.metrics.pairwise import sigmoid_kernel
sigmoid_kernel(np.c_[sepal_length, sepal_width])
#normalizing features
from sklearn.preprocessing import normalize
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
sigmoid_kernel(np.c_[sepal_length_norm, sepal_width_norm])
#assessing perfomance of sigmoid SVM with normalized features
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True))
I see what's happening. In sklearn releases pre 0.22 the default gamma parameter passed to the SVC was "auto", and in subsequent releases this was changed to "scale". The author of the article seems to have been using a previous version and therefore implicitly passing gamma="auto" (he mentions that the "current default setting for gamma is ‘auto’"). So if you're on the latest release of sklearn (0.23.2), you'll want to explicitly pass gamma='auto' when instantiating the SVC:
clf = SVC(kernel='sigmoid',gamma='auto')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
So now when you print the classification report:
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.907407 0.650000 0.750000 0.766667 0.769136 0.769136
# recall 0.980000 0.780000 0.540000 0.766667 0.766667 0.766667
# f1-score 0.942308 0.709091 0.627907 0.766667 0.759769 0.759769
# support 50.000000 50.000000 50.000000 0.766667 150.000000 150.000000
What would explain the 33% accuracy you were seeing is the fact that the default gamma is "scale", which then places all predictions in a single region of the decision plane, and as the targets are split into thirds you get a maximum accuracy of 33.3%:
clf = SVC(kernel='sigmoid')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
X = np.c_[sepal_length_norm, sepal_width_norm]
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.0 0.0 0.333333 0.333333 0.111111 0.111111
# recall 0.0 0.0 1.000000 0.333333 0.333333 0.333333
# f1-score 0.0 0.0 0.500000 0.333333 0.166667 0.166667
# support 50.0 50.0 50.000000 0.333333 150.000000 150.000000

How to normalize the data in a dataframe in the range [0,1]?

I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].
There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.
Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.
Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())

Categories

Resources