How to scale all columns except last column? - python

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)

Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)

You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Related

Concat created Nan values even after index_reset

I want to create a csv file that combines the train and test data and labels to use it for a project. The problem is that in concat function, even after using the index reset, the labels continue being Nan and i don't understand what is wrong. The datasets are in this link : https://wetransfer.com/downloads/9f0562b7ec341ebb663262af78971b8020211228154538/84d58d
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
data.to_csv('new1.csv', index=False)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
data2.to_csv('new2.csv', index=False)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
train = pd.concat([data_labels, data], axis=1, join='inner')
print(train.shape)
test = pd.concat([data2_labels, data2], axis=1, join='inner')
print(test.shape)
test.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
frame = pd.concat([train, test], axis=0)
print(frame)
I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.
To fix this: Set ignore_index=True in pd.concat(). From the docs:
ignore_index: bool, default False If True, do not use the index values
along the concatenation axis. The resulting axis will be labeled 0, …,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.
I just did concats with different order and it worked.
The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
print(data.shape)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
print(data2.shape)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)
#concat labels
l = data_labels.append(data2_labels)
#create the original dataset
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Using inverse_transform MinMaxScaler from scikit_learn to force a dataframe be in a range of another

I was following this answer to apply an inverse transformation over a scaled dataframe. My question is how can I do to transform a new one dataframe to a range of values of the original dataframe?.
So far, I did this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
cols = ['A', 'B']
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
scaler = MinMaxScaler() # default min and max values are 0 and 1, respectively
scaled_data = scaler.fit_transform(data)
orig_data = scaler.inverse_transform(scaled_data) # obtain same as `data`
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
inver_new_data = scaler.inverse_transform(new_data)
I want inver_new_data will be a dataframe with its columns in the same range of values of data columns, for instance, column A between 0.5 and 2, and so on. However I get for column A values between 8 and 17.
Any ideas?
MinMaxScaler applies to each column the following transformation:
Subtract column minimum;
Divide by column range (i.e. column max - column min).
The inverse transform applies the "inverse" operation in "inverse" order:
Multiply by column range before the transformation;
Add the column min.
Therefore for column A is doing
(df['A'] - df['A'].min())/(df['A'].max() - df['A'].min())
in particular the scaler stores the min 0.5 and the range 1.5
When you apply the inverse_transform to [8, 11, 5] this becomes:
[8*1.5 + 0.5, 11*1.5 + 0.5, 5*1.5 + 0.5]=[12.5, 18, 8]
Now, this is not suggested in general to do any machine learning, however to transform the ranges of the new column to the previous one, you can do something like the following:
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
# Create a Scaler for the initial data
scaler_data = MinMaxScaler()
# Fit the scaler with these data, but there is no need to transform them.
scaler_data.fit(data)
#Create new data
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
# Create a Scaler for the new data
scaler_new_data = MinMaxScaler()
# Trasform new data in the [0-1] range
scaled_new_data = scaler_new_data.fit_transform(new_data)
# Inverse transform new data from [0-1] to [min, max] of data
inver_new_data = scaler_data.inverse_transform(scaled_new_data)
For example this will always map the min and max of new dataframe columns to the min and max of initial dataframe columns respectively.
To explain you what is MinMaxScaler doing:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
So basically every feature of your datawill be between 0 and 1.
The moment you run: fit_transform(data), is trained.
For transformation you have:
X_scaled = scale * X + min - X.min(axis=0) * scale
where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
the scale was trained from the fitting method.
So if you run inverse_transofmr(new_data) this does not help you at all.
Also inver_new_data= scaler.transform(new_data) will not help you.
You need to precise what the same range means for you? The approach with MinMaxScalerwill not help you right now. You could only limit the columns to the min and max of the original dataframe. So for example:
dataA = new_data[['A']]
scalerA = MinMaxScaler(data['A'].min(), data['A'].max())
inver_new_data_A = scaler.fit_transform(dataA)
but this is also not th exact range, minmaxalso respects the distances between the points.

Replacing Manual Standardization with Standard Scaler Function

I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.

How can I normalize the data in a range of columns in my pandas dataframe

Suppose I have a pandas data frame surveyData:
I want to normalize the data in each column by performing:
surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())
This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:
Name State Gender Age Income Height
Sam CA M 13 10000 70
Bob AZ M 21 25000 55
Tom FL M 30 100000 45
I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.
You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:
# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.
I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options.
The way of doing that in your case when using StandardScaler would be:
from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])
Simple way and way more efficient:
Pre-calculate the mean:
dropna() avoid missing data.
mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()
dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))
this way will work...
I think it's really nice to use built-in functions
# Assuming same lines from your example
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = scaler.fit_transform(survey_data[cols_to_norm])
MinMax normalize all numeric columns with minmax_scale
import numpy as np
from sklearn.preprocessing import minmax_scale
# cols = ['Age', 'Height']
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
Note: Keeps index, column names or non-numerical variables unchanged.
import pandas as pd
import numpy as np
# let Dataset here be your data#
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
for x in dataset.columns[dataset.dtypes == 'int64']:
Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))

Categories

Resources