Normalization row by row in data set

Normalization row by row in data set - python

I'm trying to normalize a datasheet between [-1,+1], and this code I wrote can normalize columns by columns. Could you tell me how to normalize rows by rows?
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv('/-----.csv')
df_max_scaled = df.copy()
for column in df.columns:
df_max_scaled[column] = df_max_scaled[column] /df_max_scaled[column].abs().max()

You could use apply with axis=1 which will process the DF row-by-row:
df.apply(lambda x: x/x.abs().max(), axis=1)

Related

What is the best way to apply several percentage differences to pandas Data Frame?

Let's consider dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.random.randn(1000)]).transpose()
I want to apply percentage change transformations and add it to df. I want to apply 1,...,10 percentage changes. My primitive solution is:
df_copy = df.copy()
for i in range(1, 11):
to_add = df_copy.pct_change(i)
df = pd.concat([df, to_add], axis = 1)
However, I'm not sure if its the most efficient way how it can be done. Do you know if there is any option to do it better?

Problems implementing Dask MinMaxScaler

I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up.
Minimal, Reproducible Example:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
Error message:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
Not sure what the 'Categorical' is in the pivoted table, but I have tried to .as_ordered() the index:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
But I get the error message:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
Additional information
test.csv:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50

Looking at this answer:
pivot_table produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns).
Therefore adding ddf_p.columns = list(ddf_p.columns) solved the problem:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!

Extract smaller table from pivot table pandas

I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.

As suggested by #moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.
Example:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
Generate data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Split df columns in some way, eg half:
cols = int(len(df.columns)/2)
df_A = df.iloc[:, 0:cols]
df_B = df.iloc[:, cols:]
Use train_test_split:
train_A, test_A = train_test_split(df_A, test_size=0.33)
train_B, test_B = train_test_split(df_B, test_size=0.33)

Replacing Manual Standardization with Standard Scaler Function

I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.

Quantile values for each column in dataframe

I have a dataframe which consists of columns of numbers. I am trying to calc the decile rank values for each column. The following code gives me the values for the dataframe as a whole. How can I do it by column?
pd.qcut(df, 10, labels=False)
Thanks.

If you apply qcut across the columns you will get a dataframe where each entry is the rank value.
import numpy as np
import pandas as pd
data_a = np.random.random(100)
data_b = 100*np.random.random(100)
df = pd.DataFrame(columns=['A','B'], data=list(zip(data_a, data_b)))
rank = df.apply(pd.qcut, axis=0, q=10, labels=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalization row by row in data set - python

You could use apply with axis=1 which will process the DF row-by-row: df.apply(lambda x: x/x.abs().max(), axis=1)

Related

What is the best way to apply several percentage differences to pandas Data Frame?

Problems implementing Dask MinMaxScaler

Extract smaller table from pivot table pandas

Replacing Manual Standardization with Standard Scaler Function

Quantile values for each column in dataframe

Categories

Resources