python sklearn.compose.columntransformer handling missing columns

python sklearn.compose.columntransformer handling missing columns - python

I am using columntransformer for my machine learning data, When I use fit_transform I have a dataset containing the tag columns, but when I use transform for prediction I get an error, due to the missing tags columns (even when the transformer has no transformation regarding that column at all.)
'positional indexers are out-of-bounds'}```
Is there any whey to handle this more gracefully? e.g just not transforming the missing column that has no transformations over it to begin with? Alternatively, is it safe to just create a dummy column? How can I check what columns and columns order the transformer is expecting?
example case:
from sklearn.compose import ColumnTransformer
import pandas as pd
from sklearn.preprocessing import Normalizer
df = pd.DataFrame.from_records([[1,2,3], [4,5,6]], columns=[0, 1, 2])
c = ColumnTransformer(transformers=[("norm1", Normalizer(norm='l1'), [0, 1])], remainder="passthrough")
df1 = c.fit_transform(df)
df2 = df.drop(2, axis=1)
data = c.transform(df2)
gives
raise IndexError("positional indexers are out-of-bounds") IndexError:
positional indexers are out-of-bounds

Related

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)

If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.

In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Is there a way to reorder columns in a pandas data frame for hundreds of attributes?

I am implementing a classification machine learning algorithm. The dataset contains numerical as well as nominal values in no specific order. The numeric and nominal values are in separate columns but these columns are scattered throughout the dataset. Also, the columns/ attributes are exactly 279 in number. How can I reorder or sort the dataset to have numerical valued columns on one side and nominal valued columns on the other side of the dataset?
P.S. I need to sort the columns so I can implement standardization in this way:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 49: ] = sc.fit_transform(x_train[:, 49: ])
x_test[:, 49: ] = sc.transform(x_test[:, 49: ])
The link to the dataset is provided below for more information.
https://archive.ics.uci.edu/ml/datasets/Arrhythmia

Select number columns and object (string) columns and then concat them side by side:
df1 = pd.concat((df.select_dtypes('number'), df.select_dtypes('object')), axis=1)

If I understand correctly, you can first determine the dtype of the columns with
import pandas as pd
import itertools
g = df.columns.to_series().groupby(df.dtypes).groups
And then simply get a list of this new order
columns_list = []
for key in g.keys():
columns_list.append(g[key].values.tolist())
columns_list = list(itertools.chain(*columns_list))
And finally reindex
df = df.reindex(columns_list, axis=1)

Problems implementing Dask MinMaxScaler

I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up.
Minimal, Reproducible Example:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
Error message:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
Not sure what the 'Categorical' is in the pivoted table, but I have tried to .as_ordered() the index:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
But I get the error message:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
Additional information
test.csv:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50

Looking at this answer:
pivot_table produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns).
Therefore adding ddf_p.columns = list(ddf_p.columns) solved the problem:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!

How to convert pandas data frame string values to numeric values

I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?
This is the data set that I have
type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1
I want to convert those type,nameOrig,nameDest fields to neural network readable format.
I have used below method. But I don't know wheater it's right or wrong.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])
I have gone through below questions. But most of them are not worked for me
How to convert string based data frame to numeric
converting non-numeric to numeric value using Panda libraries
converting non-numeric to numeric value using Panda libraries

In this case you can use the datatype category of pandas to map strings to indices (see categorical data). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn.
import pandas as pd
df = pd.read_csv('54055554.csv', header=0, dtype={
'type': 'category', # <--
'amount': float,
'nameOrig': str,
'oldbalanceOrg': float,
'newbalanceOrig': float,
'nameDest': str,
'oldbalanceDest': float,
'newbalanceDest': float,
'isFraud': bool,
'isFlaggedFraud': bool
})
print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}
print(list(df['type'].cat.codes))
# [0, 0, 1]
The data from the CSV:
type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...

Transformation
First you need to transform the three columns using LableEncoder class.
Encoding Categorical Data
Well here you have the type as categorical value. For this you can use the class OneHotEncoder available in sklearn.preprocessing.
Avoiding Dummy Variable Trap
Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.
Code
Here I have put the sample code for your reference.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.
Example:
type_values = ['PAYMENT', 'TRANSFER']

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])

The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)

Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))

When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python sklearn.compose.columntransformer handling missing columns - python

Related

scikit preprocessing across entire dataframe

Is there a way to reorder columns in a pandas data frame for hundreds of attributes?

Problems implementing Dask MinMaxScaler

How to convert pandas data frame string values to numeric values

Imputer on some columns in a Dataframe

Categories

Resources