I'm trying to normalize a datasheet between [-1,+1], and this code I wrote can normalize columns by columns. Could you tell me how to normalize rows by rows?
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv('/-----.csv')
df_max_scaled = df.copy()
for column in df.columns:
df_max_scaled[column] = df_max_scaled[column] /df_max_scaled[column].abs().max()
You could use apply with axis=1 which will process the DF row-by-row:
df.apply(lambda x: x/x.abs().max(), axis=1)
Related
Let's consider dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.random.randn(1000)]).transpose()
I want to apply percentage change transformations and add it to df. I want to apply 1,...,10 percentage changes. My primitive solution is:
df_copy = df.copy()
for i in range(1, 11):
to_add = df_copy.pct_change(i)
df = pd.concat([df, to_add], axis = 1)
However, I'm not sure if its the most efficient way how it can be done. Do you know if there is any option to do it better?
I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up.
Minimal, Reproducible Example:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
Error message:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
Not sure what the 'Categorical' is in the pivoted table, but I have tried to .as_ordered() the index:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
But I get the error message:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
Additional information
test.csv:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50
Looking at this answer:
pivot_table produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns).
Therefore adding ddf_p.columns = list(ddf_p.columns) solved the problem:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!
I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.
As suggested by #moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.
Example:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
Generate data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Split df columns in some way, eg half:
cols = int(len(df.columns)/2)
df_A = df.iloc[:, 0:cols]
df_B = df.iloc[:, cols:]
Use train_test_split:
train_A, test_A = train_test_split(df_A, test_size=0.33)
train_B, test_B = train_test_split(df_B, test_size=0.33)
I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.
I have a dataframe which consists of columns of numbers. I am trying to calc the decile rank values for each column. The following code gives me the values for the dataframe as a whole. How can I do it by column?
pd.qcut(df, 10, labels=False)
Thanks.
If you apply qcut across the columns you will get a dataframe where each entry is the rank value.
import numpy as np
import pandas as pd
data_a = np.random.random(100)
data_b = 100*np.random.random(100)
df = pd.DataFrame(columns=['A','B'], data=list(zip(data_a, data_b)))
rank = df.apply(pd.qcut, axis=0, q=10, labels=False)