Very Large Values Predicted for Linear Regression

Very Large Values Predicted for Linear Regression - python

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression(normalize = False)
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.
Also, if there is a better way to do this I'd be eager to hear about it.

You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.
If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)
If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:
df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1 C2 C3 y
0 1 0 0 12.4
1 1 0 0 11.9
2 0 1 0 8.3
3 0 1 0 3.1
4 0 0 1 5.4
5 0 0 1 6.2
In [73]:
y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:
reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:
_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:
reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:
reg1.coef_
Out[77]:
array([ 12.15, 5.7 , 5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False
A detailed explanation for a similar question at below.
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn

I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Related

How to run model on new data that requires pd.get_dummies

I have a model that runs the following:
import pandas as pd
import numpy as np
# initialize list of lists
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
, ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
, ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category'])
print(df.head(2))
Name Attempts Score Category
0 tom 10 1 a
1 tom 15 5 a
Then I have created a dummy df to use in the model using the following code:
from sklearn.linear_model import LogisticRegression
df_dum = pd.get_dummies(df)
print(df_dum.head(2))
Attempts Score Name_matt Name_tom Category_a Category_b Category_c
0 10 1 0 1 1 0 0
1 15 5 0 1 1 0 0
Then I have created the following model:
#Model
X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values
#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]
#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)
df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])
print(dfpredictions)
Name Attempts Score Category predictions
14 matt 18 1 a 1.0
15 matt 15 6 a 1.0
16 matt 17 3 a 1.0
17 matt 14 7 c 1.0
18 matt 16 6 b 1.0
19 matt 10 2 b 1.0
Now I have new data which i would like to predict:
newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]
newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category'])
print(newdf)
Name Attempts Category
0 tom 10 a
1 tom 15 a
2 tom 14 a
Then create dummies and run prediction
newpredict = pd.get_dummies(newdf)
predict = model.predict(newpredict)
Output:
ValueError: X has 3 features per sample; expecting 6
Which makes sense because there are no categories b and c and no name called matt.
My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.
This is an example data - my dataset has 2000 columns when running pd.get_dummies. Thanks very much!

Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.
pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.
However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None
The full flow based on LabelEncoding and OneHotEncoding for your example is as below:
# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()
# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)
# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)
# predict on df_ohe_new
predict = model.predict(df_ohe_new)
Output (that you can assign back to newdf):
array([1, 1, 1])
However, if you really want to use pd.get_dummies only, then the following can work as well:
newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)
The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.
EDIT:
One thing I forgot to add is that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder

How to fill missing value using pre-trained model?

I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.
Time X Y Z Humidity
1/2/2017 13:00 31 22 21 48
1/2/2017 14:00 NaN 12 NaN NaN
1/2/2017 15:00 25 55 33 NaN
In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.
I have tried this so far:
with open('model_pickle','rb') as f:
mp = pickle.load(f)
for i, value in enumerate(df['Humidity'].values):
if np.isnan(value):
df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])
This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:
df = df.dropna()
dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18']
features = [ 'X', 'Y', 'Z']
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity
model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
pickle.dump(model,f)
I had no errors during training and saving the model.

For prediction, since you want to make sure you have all the X, Y, Z values, you can do,
df = df.dropna(subset = ["X", "Y", "Z"])
And now you can predict the values for the remaining valid examples as,
# where features = ["X", "Y", "Z"]
df['Humidity'] = mp.predict(df[features])
mp.predict will return prediction for all the rows, so there is no need to predict iteratively.
Edit:.
For inference, say you have a dataframe df, you can do,
# Get rows with missing Humidity where it can be predicted.
df_inference = df[df.Humidity.isnull()]
# remaining rows
df = df[df.Humidity.notnull()]
# This might still have rows with missing features.
# Since you cannot infer with missing features, Remove them too and add them to remaining rows
df = df.append(df_inference[df_inference[features].isnull().any(1)])
# and remove them from df_inference
df_inference = df_inference[~df_inference[features].isnull().any(1)]
#Now you can infer on these rows
df_inference['Humidity'] = mp.predict(df_inference[features])
# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
df = df.append(df_inference)
df.sort_index()

Can you report the error?
Anyway, if you have missing values you have different options to deal with them. You can either discard the datapoint entirely or try and infer the missing parts with a method of choice: mean, interpolation, etc.
Pandas documentation has a nice guide on how to deal with them:
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Try
df['Humidity'][i] = mp.predict(df[['X', 'Y', 'Z']][i])
This way you the data is passed as a single argument, as function expects. The way you wrote it, you split your data to 3 arguments.

Shuffle rows of a dataframe in pandas python brings about different regression results?

I am trying to randomise my rows in the dataframe - data before applying linear regression, but i realised the regression results differs after the rows are randomised which shouldn't be the case? Codes which i have tried using:
Without row randomisation:
data
X = data[feature_col]
y = data['median_price']
lr = LinearRegression()
lr.fit(X, y)
With row randomisation:
Method 1:
data = data.sample(frac=1)
Method 2:
data = data.sample(frac=1, axis=1)
Method 3:
from sklearn.utils import shuffle
data = shuffle(data)
Method 4:
data = data.sample(frac=1, axis=1).reset_index(drop=True)
Out of the 4 row randomisation methods i have tried, only Method 4 gives the same results as the one where no randomisation is applied. I thought row randomisation does not affects the regression results in any case?

Methods 2 and 4 are identical?
Regression results should not differ if you are applying the same type of regression to the same data (randomized or not). You should be using axis = 0 to randomize rows of dataframes, axis = 1 randomizes the columns.

How can I increase the accuracy of my Linear Regression model?(machine learning with python)

I have a machine learning project with python by using scikit-learn library. I have two seperated datasets for training and testing and I try to doing linear regression. I use this codeblock shown below:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))
My results are:
lr.coef_: [ 2.32088001e+00 2.07441948e-12 -4.73338567e-05 6.79658129e+02]
lr.intercept_: 2166.186033098048
Training set score: 0.63
Test set score: 0.5732999
What do you suggest me? How can I increase my accuracy? (adding code,parameter etc.)
My datasets is here: https://yadi.sk/d/JJmhzfj-3QCV4V

I'll elaborate a bit on #GeorgiKaradjov's answer with some examples. Your question is very broad, and there's multiple ways to gain improvements. In the end, having domain knowledge (context) will give you the best possible chance of getting improvements.
Normalise your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation
Turn categorical data into variables via, e.g., OneHotEncoding
Do feature engineering:
Are my features collinear?
Do any of my features have cross terms/higher-order terms?
Regularisation of the features to reduce possible overfitting
Look at alternative models given the underlying features and the aim of the project
1) Normalise data
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
Gives
0 0.752395
1 0.008489
2 -0.381637
3 -0.020588
4 0.171446
Name: AFP, dtype: float64
2) Categorical Feature Encoding
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
X_train = feature_engineering(X_train)
X_train.head(5)
Gives
AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105 dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1 resource_level_2 resource_level_4
0 0.752395 1 0 0 0 1 0 0 1 0 0
1 0.008489 0 0 1 0 0 1 0 1 0 0
2 -0.381637 0 0 1 0 0 1 0 1 0 0
3 -0.020588 0 0 1 0 1 0 0 1 0 0
3) Feature Engineering; collinearity
import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
You want the red line for y=x because values should be correlated with themselves. However, any red or blue columns show there's a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent from one another.
3) Feature engineering; higher-order terms
Maybe your model is too simple, you could consider adding higher order and cross terms:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)
I had a quick try at this, I don't think the higher order terms help out much. It's also possible your data is non-linear, a quick logarithm or the Y-output gives a worse fit, suggesting it's linear. You could also look at the actuals, but I was too lazy....
4) Regularisation
Try using sklearn's RidgeRegressor and playing with alpha:
lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
5) Alternative models
Sometimes linear regression is not always suited. For example, Random Forest Regressors can perform very well, and are usually insensitive to data being standardised, and being categorical/continuous. Other models include XGBoost, and Lasso (Linear regression with L1 regularisation).
lr = RandomForestRegressor(n_estimators=100)
Putting it all together
I got carried away and started looking at your problem, but couldn't improve it too much without knowing all the context of the features:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)
lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])
Resulting in:
Training set score: 0.90
Test set score: 0.61
You can look at the importance of the variables (higher value, more important).
Importance
AFP 0.882295
dev_plat_077070 0.020817
dev_plat_077082 0.001162
dev_plat_077117108116105 0.016334
dev_plat_080067 0.004077
lang_type_051071076 0.012458
lang_type_052071076 0.021195
lang_type_065112071 0.001118
resource_level_1 0.012644
resource_level_2 0.006673
resource_level_4 0.021227
You could start looking at the hyperparameters to get improvements on this also: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

here are some tips :
Data preparation(exploration) is one of the most important steps in a machine learning project, you need to start with it.
did you clean your data? if not start with that step!
As said in this tutorial :
There are no shortcuts for data exploration. If you are in a state of
mind, that machine learning can sail you away from every data storm,
trust me, it won’t.After some point of time, you’ll realize that you
are struggling at improving model’s accuracy. In such situation, data
exploration techniques will come to your rescue.
here is some step for data exploration :
missing values treatment,
outlier removal
feature engineering
After that try to perform univariate and bivariate analysis with your features.
use one hot encoding to transform you categorical features into numerics ones.
this is what you need according to what we have talked about in the comments.
here is a tutorial on how to deal with categorical variables, one-hot encoding from sklearn learn is the best technic for your problem.
Using ASCII representation is not the best practice for handling categorical features
You can find more about data exploration in here
follow the suggestions I gave to you and thank me later.

normalize your data
Depending on the type of input features you can extract different features from them (feature combinations are possible too)
If your data is not linearly separable, you won't be able to predict it well. You may need to use another model - Logistic regression, SVR, NN / whatever

Normalize columns of a dataframe

I have a dataframe in pandas where each column has different value range. For example:
df:
A B C
1000 10 0.5
765 5 0.35
800 7 0.09
Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?
My desired output is:
A B C
1 1 1
0.765 0.5 0.7
0.8 0.7 0.18(which is 0.09/0.5)

one easy way by using Pandas: (here I want to use mean normalization)
normalized_df=(df-df.mean())/df.std()
to use min-max normalization:
normalized_df=(df-df.min())/(df.max()-df.min())
Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

You can use the package sklearn and its associated preprocessing utilities to normalize the data.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Detailed Example of Normalization Methods
Pandas normalization (unbiased)
Sklearn normalization (biased)
Does biased-vs-unbiased affect Machine Learning?
Mix-max scaling
References:
Wikipedia: Unbiased Estimation of Standard Deviation
Example Data
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
Normalization using pandas (Gives unbiased estimates)
When normalizing we simply subtract the mean and divide by standard deviation.
df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
A B C
0 -1.0 -1.0 a
1 0.0 0.0 b
2 1.0 1.0 c
Normalization using sklearn (Gives biased estimates, different from pandas)
If you do the same thing with sklearn you will get DIFFERENT output!
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
A B C
0 -1.224745 -1.224745 a
1 0.000000 0.000000 b
2 1.224745 1.224745 c
Does Biased estimates of sklearn makes Machine Learning Less Powerful?
NO.
The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.
From official documentation:
We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.
What about MinMax Scaling?
There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
})
(df - df.min()) / (df.max() - df.min())
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
arr_scaled = scaler.fit_transform(df)
print(arr_scaled)
[[0. 0. ]
[0.5 0.5]
[1. 1. ]]
df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
You can do the following:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

Your problem is actually a simple transform acting on the columns:
def f(s):
return s/s.max()
frame.apply(f, axis=0)
Or even more terse:
frame.apply(lambda x: x/x.max(), axis=0)

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df)
df.loc[:,:] = scaled_values

Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.
Simple is Beautiful:
df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

You can create a list of columns that you want to normalize
column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp
Your Pandas Dataframe is now normalized only at the columns you want
However, if you want the opposite, select a list of columns that you DON'T want to normalize, you can simply create a list of all columns and remove that non desired ones
column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

I think that a better way to do that in pandas is just
df = df/df.max().astype(np.float64)
Edit If in your data frame negative numbers are present you should use instead
df = df/df.loc[df.abs().idxmax()].astype(np.float64)

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.
My solution to this type of issue is following:
from sklearn import preprocesing
x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
x_new = pd.DataFrame(x_scaled)
df = pd.concat([df.Categoricals,x_new])

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It's a combination of #shg and #Cina answers ):
features_to_normalize = ['A', 'B', 'C']
# could be ['A','B']
df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

Normalize
You can use minmax_scale to transform each column to a scale from 0-1.
from sklearn.preprocessing import minmax_scale
df[:] = minmax_scale(df)
Standardize
You can use scale to center each column to the mean and scale to unit variance.
from sklearn.preprocessing import scale
df[:] = scale(df)
Column Subsets
Normalize single column
from sklearn.preprocessing import minmax_scale
df['a'] = minmax_scale(df['a'])
Normalize only numerical columns
import numpy as np
from sklearn.preprocessing import minmax_scale
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
Full Example
# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale
# Sample data
df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']})
# MinMax normalize all numeric columns
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
# Result
print(df)
# a b c
# 0 0.0 1.0 x
# 2 0.5 0.5 y
# 3 1.0 0.0 z
Notes:
In all examples scale can be used instead of minmax_scale. Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.
Caution:
For machine learning, use minmax_scale or scale after train_test_split to avoid data leakage.
Info
More info on standardization and normalization:
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
https://en.wikipedia.org/wiki/Normalization_(statistics)
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

It is only simple mathematics. The answer should as simple as below.
normed_df = (df - df.min()) / (df.max() - df.min())

df_normalized = df / df.max(axis=0)

This is how you do it column-wise using list comprehension:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

You can simply use the pandas.DataFrame.transform1 function in this way:
df.transform(lambda x: x/x.max())

def normalize(x):
try:
x = x/np.linalg.norm(x,ord=1)
return x
except :
raise
data = pd.DataFrame.apply(data,normalize)
From the document of pandas,DataFrame structure can apply an operation (function) to itself .
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
You can apply a custom function to operate the DataFrame .

The following function calculates the Z score:
def standardization(dataset):
""" Standardization of numeric fields, where all values will have mean of zero
and standard deviation of one. (z-score)
Args:
dataset: A `Pandas.Dataframe`
"""
dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
# Normalize numeric columns.
for column, dtype in dtypes:
if dtype == 'float32':
dataset[column] -= dataset[column].mean()
dataset[column] /= dataset[column].std()
return dataset

You can do this in one line
DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)
it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.

Pandas does column wise normalization by default. Try the code below.
X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())
The output values will be in range of 0 and 1.

Hey use the apply function with lambda which speeds up the process:
def normalize(df_col):
# Condition to exclude 'ID' and 'Class' feature
if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')):
max_value = df_col.max()
min_value = df_col.min()
#It avoids NaN and return 0 instead
if max_value == min_value:
return 0
sub_value = max_value - min_value
return np.divide(np.subtract(df_col,min_value),sub_value)
else:
return df_col
df_normalize = df.apply(lambda x :normalize(x))

To normalise a DataFrame column, using only native Python.
Different values influence processes, e.g. plot colours.
Between 0 and 1:
min_val = min(list(df['col']))
max_val = max(list(df['col']))
df['col'] = [(x - min_val) / max_val for x in df['col']]
Between -1 to 1:
df['col'] = [float(i)/sum(df['col']) for i in df['col']]
OR
df['col'] = [float(tp) / max(abs(df['col'])) for tp in df['col']]

If your data is positively skewed, the best way to normalize is to use the log transformation:
df = np.log10(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Very Large Values Predicted for Linear Regression - python

I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Related

How to run model on new data that requires pd.get_dummies

How to fill missing value using pre-trained model?

Shuffle rows of a dataframe in pandas python brings about different regression results?

How can I increase the accuracy of my Linear Regression model?(machine learning with python)

Normalize columns of a dataframe

Categories

Resources