Problems implementing Dask MinMaxScaler - python

I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up.
Minimal, Reproducible Example:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
Error message:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
Not sure what the 'Categorical' is in the pivoted table, but I have tried to .as_ordered() the index:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
But I get the error message:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
Additional information
test.csv:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50

Looking at this answer:
pivot_table produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns).
Therefore adding ddf_p.columns = list(ddf_p.columns) solved the problem:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!

Related

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Applying scikitlearn preprocessing to pandas without causing warnings

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Extract smaller table from pivot table pandas

I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.
As suggested by #moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.
Example:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
Generate data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Split df columns in some way, eg half:
cols = int(len(df.columns)/2)
df_A = df.iloc[:, 0:cols]
df_B = df.iloc[:, cols:]
Use train_test_split:
train_A, test_A = train_test_split(df_A, test_size=0.33)
train_B, test_B = train_test_split(df_B, test_size=0.33)

How to set pandas dataframe column as index when generating loading matrix for PCA

I am using sklearn in python to perform principle component analysis (PCA) on gene expression data. My data is loaded as a pandas dataframe, for which I can call df.head() and the df looks good. I am using sklearn to generate a loading matrix, but the matrix only displays a generic index, and will not accept a column name for an index. I have 1722 genes, so it is important that I obtain the loading score for each gene computationally.
Here is my code for PCA:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import preprocessing
# Load the data as pandas dataframe
cols = ['gene', 'FC_TSWV', 'FC_WFT', 'FC_TSWV_WFT']
df = pd.read_csv('./PCA.txt', names = cols, header = None, index_col = 'gene')
# preprocess data:
scaled_df = preprocessing.scale(df.T)
# perform PCA
pca = PCA()
pca.fit(scaled_df)
pca_data = pca.transform(scaled_df)
# Generate loading matrix. HERE IS WHERE THE TROUBLE IS:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
# Print loading matrix
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
print(loading_scores)
I have tried:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
loading_scores = pd.Series(pca.components_[0], index = df['gene'])
loading_scores = pd.Series(pca.components_[0], index = df.loc['gene']
AttributeError: 'DataFrame' object has no attribute 'gene'.
If I do not specify an index at all, the loading scores are designated with the generic 0-based index.
Anyone know how to fix this?
Use df.index instead of df.gene or df['gene']
Once you set a certain column to be the index, the way to access it is through the .index attribute, and not through the column's name anymore.

python sklearn.compose.columntransformer handling missing columns

I am using columntransformer for my machine learning data, When I use fit_transform I have a dataset containing the tag columns, but when I use transform for prediction I get an error, due to the missing tags columns (even when the transformer has no transformation regarding that column at all.)
'positional indexers are out-of-bounds'}```
Is there any whey to handle this more gracefully? e.g just not transforming the missing column that has no transformations over it to begin with? Alternatively, is it safe to just create a dummy column? How can I check what columns and columns order the transformer is expecting?
example case:
from sklearn.compose import ColumnTransformer
import pandas as pd
from sklearn.preprocessing import Normalizer
df = pd.DataFrame.from_records([[1,2,3], [4,5,6]], columns=[0, 1, 2])
c = ColumnTransformer(transformers=[("norm1", Normalizer(norm='l1'), [0, 1])], remainder="passthrough")
df1 = c.fit_transform(df)
df2 = df.drop(2, axis=1)
data = c.transform(df2)
gives
raise IndexError("positional indexers are out-of-bounds") IndexError:
positional indexers are out-of-bounds

Categories

Resources