Delete rows in pandas frame that have other shape - python

I am trying to delete rows in a pandas data frame that have another shape than (99, 13) in the 'MEL' column.
path MEL word
0 8d37d10e7f97ddea2eca9d39a4cf821b4457b041.wav [[-10.160675, -13.804866, 0.9188097, 4.415375,... one
1 9a8f761be3fa0d0a963f5612ba73e68cc0ad11ba.wav [[-10.482644, -13.339122, -3.4994812, -5.29343... one
2 314cdc39f628bc68d216498b2080bcc7a549a45f.wav [[-11.076196, -13.980294, -17.289637, -41.0668... one
3 cc499e63eee4a3bcca48b5b452df04990df83570.wav [[-13.830213, -12.64104, -3.7780707, -10.76490... one
4 38cdcc4d9432ce4a2fe63e0998dbca91e64b954a.wav [[-11.967776, -23.27864, -10.3656, -8.786977, ... one
I have tried the folowing:
indexNames = merged[ merged['MEL'].shape != (99,13) ].index
merged.drop(indexNames , inplace=True)
The first line of code however gives me key error: True. Anyone an idea on how to make this happen?

The condition
merged['MEL'].shape != (99,13)
evaluates to either True or False.
Please note that you may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame). More here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
EDIT: This code might help
# generate sample dataset
df = pd.DataFrame(data = {'col1': [np.random.rand(3,2),np.random.rand(5,2),np.random.rand(7,8),np.random.rand(5,2)],
'col2': ['b','a','b','q'],
'col3': ['c','c','c','q'],
'col4': ['d','d','d','q'],
'col5': ['e','e','a','q'] })
for index in df.index:
if df.loc[index]['col1'].shape !=(5,2):
df.drop(index , inplace=True)
EDIT2: Without a loop:
df = pd.DataFrame(data = {'col1': [np.random.rand(3,2),np.random.rand(5,2),np.random.rand(7,8),np.random.rand(5,2)],
'col2': ['b','a','b','q'],
'col3': ['c','c','c','q'],
'col4': ['d','d','d','q'],
'col5': ['e','e','a','q'] })
df['shapes'] = [x.shape for x in df.col1.values]
df = df[df['shapes']!=(5,2)].drop('shapes', axis = 1)

... In other words, you want all the rows where the column 'MEL' has the shape (99, 13). I would do
my_desired_df = merged[merged['MEL'].shape == (99,13)]

You need to get a series of the shapes
df['MEL'].apply(lambda x: x.shape)
Then you can test this to get a boolean series
df['MEL'].apply(lambda x: x.shape) == (93,3)
And then index with the boolean series
new_df = df.loc[df['MEL'].apply(lambda x: x.shape) == (93,3), :]
This will give you everything that matches your shape. It's probably easier to do it this way then to play with df.drop(), but you could do that do with
correct = df['MEL'].apply(lambda x: x.shape) == (93,3)
new_df = df.drop(correct[~correct].index)

Related

My dataframe is not changed when I run for loop(comparing two dataframes)

I have two dataset, one with over 100,000 rows and 300 columns and the other with 200 rows and 6 columns.
I'm comparing these two datasets and updating df1 from df2 using for loop.
Here is the sample dataset
df1:
KEY MAIN_METHOD DRUG_ETCDTL
0 100944 1 unknown
1 67488 20 unknown
2 101476 20 unknown
3 102549 1 sleepingpill_plunitrazeparm
4 103227 1 some drug
df2:
5. 방법/수단 Unnamed: 4
0 100944 sleepingpill_unknown
1 100984 others_green material
2 101476 others_anorexia
3 102549 sleepingpill_plunitrazeparm
4 103227 sleepingpill_pentobarbytal
and here is the code that I tried:
for i in range(0,4):
index_key = df2['5. 방법/수단'][i]
index_rawdata = df1.loc[df1['KEY']==index_key,'DRUG_ETCDTL'].index[0]
method1 = df1['DRUG_ETCDTL'][index_rawdata]
method2 = df1['METHOD_ETCDTL'][index_rawdata]
# split df2
mainmethod = df2['Unnamed: 4'].str.split('_',expland=False)
mainmethod[i][0] = mainmethod[i][0].replace('sleepingpill','1').replace('others','20')
# change the type so we can compare it with df1
mainmethod[i][0] = int(mainmethod[i][0])
if (mainmethod[i][1] == 1) & (df1['MAIN_METHOD'][index_rawdata] ==1 ):
method1 = mainmethod[i][1]
elif (mainmethod[i][1] == 20) & df1['MAIN_METHOD'][index_rawdata] == 20):
method2 = mainmethodp[i][1]
so the df1 should be changed but when it use print df1 it is not changed.
The desired output is:
KEY MAIN_METHOD DRUG_ETCDTL
0 100944 1 unknown
1 67488. 20 unknown
2 101476 20 anorexia
3 102549 1 plunitrazeparm
4 103227 1 pentobarbytal
NOTE: I approached this for loop method since I didn't want to manipulate df2
To address the issue of the different column sizes, this solution manipulates the indexes of the two data frames before performing an update of df1 using the pandas.DataFrame.update() method. The update method aligns the data frames using the index values and updates the values in columns with matching names.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'KEY': [100944, 67488, 101476, 102549, 103227, 123456],
'MAIN_METHOD': [1, 20, 20, 1, 1, 20],
'DRUG_ETCDTL': ['unknown', 'unknown', 'unknown', 'sleepingpill_plunitrazeparm', 'some drug', 'something extra']
}, index=np.arange(111,117))
df2 = pd.DataFrame({
'5. 방법/수단': [100944, 100984, 101476, 102549, 103227],
'Unnamed: 4': ['sleepingpill_unknown', 'others_green material', 'others_anorexia', 'sleepingpill_plunitrazeparm', 'sleepingpill_pentobarbytal']
})
# make a temporary copy of 'df2'
tmp_df = df2[['5. 방법/수단', 'Unnamed: 4']].copy()
# rename columns
tmp_df.columns = ['KEY', 'METHOD_DRUG']
# split the string to get 'METHOD' and 'DRUG_ETCDTL' information
tmp_df[['METHOD','DRUG_ETCDTL']] = tmp_df['METHOD_DRUG'].str.split('_', expand=True)
# use a map to create 'MAIN_METHOD' column
method_map = { 'sleepingpill': 1, 'others': 20 }
tmp_df['MAIN_METHOD'] = tmp_df['METHOD'].map(method_map)
# drop all unwanted DataFrame columns
tmp_df.drop(['METHOD_DRUG', 'METHOD'], inplace=True, axis=1)
# make a copy of the index of df1
index_copy = df1.index.copy(dtype=type(df1.index[0]))
# make 'KEY' and 'MAIN_METHOD' columns the new index
df1.set_index(['KEY', 'MAIN_METHOD'], inplace=True, append=False, drop=True)
# create the same index for tmp_df
tmp_df.set_index(['KEY', 'MAIN_METHOD'], inplace=True, append=False, drop=True)
# update df1 with the values in df2
df1.update(tmp_df)
# restore the 'KEY' and 'MAIN_METHOD' columns in df1
df1.reset_index(inplace=True)
# restore the original index
df1.set_index(index_copy, inplace=True, append=False, drop=True)
# delete the temporary data frame
del tmp_df
# delete the copy of the df1 index
del index_copy
ORIGINAL SOLUTION: This works when there are the same number of columns in both data frames.
This solution avoids for loops and instead uses a temporary data frame to perform the task. The strings in the Unnamed: 4 column are split using the str.split() function provided by Pandas. The MAIN_METHOD information is transformed using a mapping. The df1 data frame is conditionally updated using numpy.where() before the temporay data frame is deleted.
EDIT: The code has been modified to convert the temporary data frame column series to a numpy array using .values to avoid the error:
ValueError: Can only compare identically-labeled Series objects
Modified np.where() conditions:
df1['DRUG_ETCDTL'] = np.where(((df1['KEY']==tmp_df['KEY'].values) &
(df1['MAIN_METHOD']==tmp_df['MAIN_METHOD'].values)),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
An alternative solution to avoiding the error would be to use .equals() instead of == when performing the comparison.
df1['DRUG_ETCDTL'] = np.where(((df1['KEY'].equals(tmp_df['KEY'])) &
(df1['MAIN_METHOD'].equals(tmp_df['MAIN_METHOD']))),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
Original code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'KEY': [100944, 67488, 101476, 102549, 103227],
'MAIN_METHOD': [1, 20, 20, 1, 1],
'DRUG_ETCDTL': ['unknown', 'unknown', 'unknown', 'sleepingpill_plunitrazeparm', 'some drug']
}, index=np.arange(11,16))
df2 = pd.DataFrame({
'5. 방법/수단': [100944, 100984, 101476, 102549, 103227],
'Unnamed: 4': ['sleepingpill_unknown', 'others_green material', 'others_anorexia', 'sleepingpill_plunitrazeparm', 'sleepingpill_pentobarbytal']
})
# make a temporary copy of 'df2'
tmp_df = df2[['5. 방법/수단', 'Unnamed: 4']].copy()
# rename columns
tmp_df.columns = ['KEY', 'METHOD_DRUG']
# split the string to get 'METHOD' and 'DRUG_ETCDTL' information
tmp_df[['METHOD', 'DRUG_ETCDTL']] = tmp_df['METHOD_DRUG'].str.split('_', expand=True)
# use a mapping to create 'MAIN_METHOD' column
method_map = { 'sleepingpill': 1, 'others': 20 }
tmp_df['MAIN_METHOD'] = tmp_df['METHOD'].map(method_map)
# drop unwanted columns (This step is optional)
tmp_df.drop(['METHOD_DRUG', 'METHOD'], inplace=True, axis=1)
# update 'df1'
df1['DRUG_ETCDTL'] = np.where(((df1['KEY']==tmp_df['KEY'].values) &
(df1['MAIN_METHOD']==tmp_df['MAIN_METHOD'].values)),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
# delete temporary copy of 'df2'
del tmp_df

Python Pandas How to combine columns according to the condition of checking another column

I have a DataFrame:
value,combined,value_shifted,Sequence_shifted,long,short
12834.0,2.0,12836.0,3.0,2.0,-2.0
12813.0,-2.0,12781.0,-3.0,-32.0,32.0
12830.0,2.0,12831.0,3.0,1.0,-1.0
12809.0,-2.0,12803.0,-3.0,-6.0,6.0
12822.0,2.0,12805.0,3.0,-17.0,17.0
12800.0,-2.0,12807.0,-3.0,7.0,-7.0
12773.0,2.0,12772.0,3.0,-1.0,1.0
12786.0,-2.0,12787.0,1.0,1.0,-1.0
12790.0,2.0,12784.0,3.0,-6.0,6.0
I want to combine the long and short columns according to the value of the combined column
If df.combined == 2 then we leave the value long
If df.combined == -2 then we leave the value short
Expected result:
value,combined,value_shifted,Sequence_shifted,calc
12834.0,2.0,12836.0,3.0,2.0
12813.0,-2.0,12781.0,-3.0,32
12830.0,2.0,12831.0,3.0,1.0
12809.0,-2.0,12803.0,-3.0,6.0
12822.0,2.0,12805.0,3.0,-17
12800.0,-2.0,12807.0,-3.0,-1.0
12773.0,2.0,12772.0,3.0,-1.0
12786.0,-2.0,12787.0,1.0,-6.0
12790.0,2.0,12784.0,3.0,20.0
Use if possible 2,-2 or another values in combined column numpy.select:
df['calc'] = np.select([df['combined'].eq(2), df['combined'].eq(-2)],
[df['long'], df['short']])
Or if only 2,-1 values use numpy.where:
df['calc'] = np.where(df['combined'].eq(2), df['long'], df['short'])
Try this:
df['calc'] = df['long'].where(df['combined'] == 2, df['short'])
df['calc'] = np.nan
mask_2 = df['combined'] == 2
df.loc[mask_2, 'calc'] = df.loc[mask_2, 'long']
mask_minus_2 = df['combined'] == -2
df.loc[mask_minus_2, 'calc'] = df.loc[mask_minus_2, 'short']
then you can drop the long and short columns:
df.drop(columns=['long', 'short'], inplace=True)

How do I check if a value is of, or promotable to, a column type in pandas?

For example, suppose I have the following DataFrame.
import pandas as pd
df = pd.DataFrame([['a', 1.3, 10], ['b', 2, 20]], columns=['id', 'v1', 'v2'])
df = df.astype({col: 'category' for col in df.columns[df.dtypes == object]})
print(df)
print()
print(df.dtypes)
id v1 v2
0 a 1.3 10
1 b 2.0 20
id category
v1 float64
v2 int64
Given a value and a column identifier, I need to know whether the type of the value is compatible with the column. (Of the same type or promotable.)
For category fields, I'd like to know if a value is in the category. I can do something like
'x' in df['id'].unique()
but there may be a more efficient way.
Thanks.
Suppose you have a value x=4.3
you can simply compare like:
df.v1.dtype will give the data type of the particular column you want to compare (v1 in this case)
type(x)==df.v1.dtype
#output True
I think 'x' in df['id'].unique() will check if the value 'x' is in the column—which is not the same as checking if it is one of the categories.
Based on this answer it looks like you check if a value is in the categorical as follows:
'x' in df["id"].cat.categories
Test:
ids = pd.Categorical(["a", "b"], categories=["a", "b", "c"])
assert(("c" in ids) is False)
assert(("c" in ids.categories) is True)
UPDATE:
Is this what you wanted?
def check_type(x, df, name):
try:
return x in df[name].cat.categories
except AttributeError:
return type(x) == df[name].dtype
Test:
assert(check_type('a', df, "ids") is True)
assert(check_type('c', df, "ids") is True)
assert(check_type(3.4, df, "ids") is False)
assert(check_type(3.4, df, "v1") is True)
assert(check_type(3.4, df, "v2") is False)
assert(check_type(3, df, "v1") is False)
assert(check_type(3, df, "v2") is True)

Finding index of a closest value from another dataframe

I have two dataframes measuring two properties from an instrument, where the depths are offset for a certain dz. Note that the example below is extremely simplified.
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
How do I get the index of df2.depth_2 that gets closest the first element of df1.depth_1 ?
Using reindex with method nearest
df2.reset_index().set_index('depth_2').reindex(df1.depth_1,method = 'nearest')['index'].unique()
Out[265]: array([14], dtype=int64)
You can use pandas merge_asof function (you will need to order your data first if it isn't in real life)
df1 = df1.sort_values(by='depth_1')
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1, df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
if you just wanted that for the first value in df1 you could do the join on the top row:
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1.head(1), df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
Get the absolute difference between all elements of df2 and first element of df1 and then get it's index:
import pandas as pd
import numpy as np
def get_closest(df1, df2, idx):
abs_diff = np.array([abs(df1['depth_1'][idx]-item) for item in df2['depth_2']])
return abs_diff.argmin()
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
get_closest(df1,df2,0)
Output:
14

Pandas - Convert categorical values to number scale and create new column with replacements (not labelencode) [duplicate]

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.
Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Traceback (most recent call last):
File "", line 1, in
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit
y = column_or_1d(y, warn=True)
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (6, 3)
Any thoughts on how to get around this problem?
You can easily do this though,
df.apply(LabelEncoder().fit_transform)
EDIT2:
In scikit-learn 0.20, the recommended way is
OneHotEncoder().fit_transform(df)
as the OneHotEncoder now supports string input.
Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.
EDIT:
Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.
For inverse_transform and transform, you have to do a little bit of hack.
from collections import defaultdict
d = defaultdict(LabelEncoder)
With this, you now retain all columns LabelEncoder as dictionary.
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))
# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))
# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
MOAR EDIT:
Using Neuraxle's FlattenForEach step, it's possible to do this as well to use the same LabelEncoder on all the flattened data at once:
FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)
For using separate LabelEncoders depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances.
As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart's excellent blog post found here.
Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform() methods. In your case, a good start might be something like this:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
'fruit': ['apple','orange','pear','orange'],
'color': ['red','orange','green','green'],
'weight': [5,6,3,4]
})
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows:
MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)
Which transforms our fruit_data dataset from
to
Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded (which I believe is what you were originally looking for):
MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))
This transforms
to
.
Note that it'll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).
Another nice feature about this is that we can use this custom transformer in a pipeline:
encoding_pipeline = Pipeline([
('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
# add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)
Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer and sklearn.preprocessing.OneHotEncoder:
If you only have categorical variables, OneHotEncoder directly:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(handle_unknown='ignore').fit_transform(df)
If you have heterogeneously typed features:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
(categorical_columns, OneHotEncoder(handle_unknown='ignore'),
(numerical_columns, RobustScaler())
column_trans.fit_transform(df)
More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
We don't need a LabelEncoder.
You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.
>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
location owner pets
0 1 1 0
1 0 2 1
2 0 0 0
3 1 1 2
4 1 3 1
5 0 2 1
To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:
>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)}
for col in df}
{'location': {0: 'New_York', 1: 'San_Diego'},
'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)
However, for the purpose of a few classification tasks etc. you could use
pandas.get_dummies(input_df)
this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more
It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.
First, let's make a dictionary of dictionaries mapping the columns and their values to their new replacement values.
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}
Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.
inverse_transform_dict = {}
for col, d in transform_dict.items():
inverse_transform_dict[col] = {v:k for k, v in d.items()}
inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.
df.replace(transform_dict)
location owner pets
0 1 1 0
1 0 2 1
2 0 0 0
3 1 1 2
4 1 3 1
5 0 2 1
We can easily go back to the original by again chaining the replace method
df.replace(transform_dict).replace(inverse_transform_dict)
location owner pets
0 San_Diego Champ cat
1 New_York Ron dog
2 New_York Brick cat
3 San_Diego Champ monkey
4 San_Diego Veronica dog
5 New_York Ron dog
This is a year-and-a-half after the fact, but I too, needed to be able to .transform() multiple pandas dataframe columns at once (and be able to .inverse_transform() them as well). This expands upon the excellent suggestion of #PriceHardman above:
class MultiColumnLabelEncoder(LabelEncoder):
"""
Wraps sklearn LabelEncoder functionality for use on multiple columns of a
pandas dataframe.
"""
def __init__(self, columns=None):
self.columns = columns
def fit(self, dframe):
"""
Fit label encoder to pandas columns.
Access individual column classes via indexig `self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# fit LabelEncoder to get `classes_` for the column
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
# append this column's encoder
self.all_encoders_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return self
def fit_transform(self, dframe):
"""
Fit label encoder and return encoded labels.
Access individual column classes via indexing
`self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
Access individual column encoded labels via indexing
`self.all_labels_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_labels_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# instantiate LabelEncoder
le = LabelEncoder()
# fit and transform labels in the column
dframe.loc[:, column] =\
le.fit_transform(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
self.all_labels_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
dframe.loc[:, column] = le.fit_transform(
dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return dframe.loc[:, self.columns].values
def transform(self, dframe):
"""
Transform labels to normalized encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[
idx].transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
def inverse_transform(self, dframe):
"""
Transform labels back to original encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
Example:
If df and df_copy() are mixed-type pandas dataframes, you can apply the MultiColumnLabelEncoder() to the dtype=object columns in the following way:
# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns
# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)
# fit to `df` data
mcle.fit(df)
# transform the `df` data
mcle.transform(df)
# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
[0, 5, 1, ..., 1, 1, 2],
[1, 1, 1, ..., 1, 1, 2],
...,
[3, 5, 1, ..., 1, 1, 2],
# transform `df_copy` data
mcle.transform(df_copy)
# returns output like below (assuming the respective columns
# of `df_copy` contain the same unique values as that particular
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
[0, 5, 1, ..., 1, 1, 2],
[1, 1, 1, ..., 1, 1, 2],
...,
[3, 5, 1, ..., 1, 1, 2],
# inverse `df` data
mcle.inverse_transform(df)
# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
...,
['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)
# inverse `df_copy` data
mcle.inverse_transform(df_copy)
# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
...,
['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)
You can access individual column classes, column labels, and column encoders used to fit each column via indexing:
mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_
No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:
le.fit(df.columns)
In the above code you will have a unique number corresponding to each column.
More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column's encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:
le.transform(df.columns.get_values())
Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:
le.fit([y for x in df.get_values() for y in x])
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:
le.transform([df.get_value(0, df.columns[0])])
The question you had in your comment is a bit more complicated, but can still
be accomplished:
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
The above code does the following:
Make a unique combination of all of the pairs of (column, row)
Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
Fits the new items to the LabelEncoder.
Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
Remember that each lookup is now a string representation of a tuple that
contains the (column, row).
I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).
Very Rough ideas...
first, identify which columns needed LabelEncoder, then loop through each column.
def cat_var(df):
"""Identify categorical features.
Parameters
----------
df: original df after missing operations
Returns
-------
cat_var_df: summary df with col index and col name for all categorical vars
"""
col_type = df.dtypes
col_names = list(df)
cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
cat_var_df = pd.DataFrame({'cat_ind': cat_var_index,
'cat_name': cat_var_name})
return cat_var_df
from sklearn.preprocessing import LabelEncoder
def column_encoder(df, cat_var_list):
"""Encoding categorical feature in the dataframe
Parameters
----------
df: input dataframe
cat_var_list: categorical feature index and name, from cat_var function
Return
------
df: new dataframe where categorical features are encoded
label_list: classes_ attribute for all encoded features
"""
label_list = []
cat_var_df = cat_var(df)
cat_list = cat_var_df.loc[:, 'cat_name']
for index, cat_feature in enumerate(cat_list):
le = LabelEncoder()
le.fit(df.loc[:, cat_feature])
label_list.append(list(le.classes_))
df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
return df, label_list
The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column.
This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.
EDIT:
Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)
A short way to LabelEncoder() multiple columns with a dict():
from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
le_dict[col].fit_transform(df[col])
and you can use this le_dict to labelEncode any other column:
le_dict[col].transform(df_another[col])
If you have numerical and categorical both type of data in dataframe
You can use : here X is my dataframe having categorical and numerical both variables
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(0,X.shape[1]):
if X.dtypes[i]=='object':
X[X.columns[i]] = le.fit_transform(X[X.columns[i]])
Note: This technique is good if you are not interested in converting them back.
After lots of search and experimentation with some answers here and elsewhere, I think your answer is here:
pd.DataFrame(columns=df.columns,
data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))
This will preserve category names across columns:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
['A','E','H','F','G','I','K','','',''],
['A','C','I','F','H','G','','','','']],
columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])
pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0 1 2 3 4 5 6 7 9 10 8
1 1 5 8 6 7 9 10 0 0 0
2 1 3 9 6 8 7 0 0 0 0
Instead of LabelEncoder we can use OrdinalEncoder from scikit learn, which allows multi-column encoding.
Encode categorical features as an integer array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
[1., 0.]])
Both the description and example were copied from its documentation page which you can find here:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder
Using Neuraxle
TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like: FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df).
With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let's simply import:
from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach
Same shared encoder for columns:
Here is how one shared LabelEncoder will be applied on all the data to encode it:
p = FlattenForEach(LabelEncoder(), then_unflatten=True)
Result:
p, predicted_output = p.fit_transform(df.values)
expected_output = np.array([
[6, 7, 6, 8, 7, 7],
[1, 3, 0, 1, 5, 3],
[4, 2, 2, 4, 4, 2]
]).transpose()
assert np.array_equal(predicted_output, expected_output)
Different encoders per column:
And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:
p = ColumnTransformer([
# A different encoder will be used for column 0 with name "pets":
(0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
# A shared encoder will be used for column 1 and 2, "owner" and "location":
([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
], n_dimension=2)
Result:
p, predicted_output = p.fit_transform(df.values)
expected_output = np.array([
[0, 1, 0, 2, 1, 1],
[1, 3, 0, 1, 5, 3],
[4, 2, 2, 4, 4, 2]
]).transpose()
assert np.array_equal(predicted_output, expected_output)
Following up on the comments raised on the solution of #PriceHardman I would propose the following version of the class:
class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
pdu._is_cols_input_valid(cols)
self.cols = cols
self.les = {col: LabelEncoder() for col in cols}
self._is_fitted = False
def transform(self, df, **transform_params):
"""
Scaling ``cols`` of ``df`` using the fitting
Parameters
----------
df : DataFrame
DataFrame to be preprocessed
"""
if not self._is_fitted:
raise NotFittedError("Fitting was not preformed")
pdu._is_cols_subset_of_df_cols(self.cols, df)
df = df.copy()
label_enc_dict = {}
for col in self.cols:
label_enc_dict[col] = self.les[col].transform(df[col])
labelenc_cols = pd.DataFrame(label_enc_dict,
# The index of the resulting DataFrame should be assigned and
# equal to the one of the original DataFrame. Otherwise, upon
# concatenation NaNs will be introduced.
index=df.index
)
for col in self.cols:
df[col] = labelenc_cols[col]
return df
def fit(self, df, y=None, **fit_params):
"""
Fitting the preprocessing
Parameters
----------
df : DataFrame
Data to use for fitting.
In many cases, should be ``X_train``.
"""
pdu._is_cols_subset_of_df_cols(self.cols, df)
for col in self.cols:
self.les[col].fit(df[col])
self._is_fitted = True
return self
This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.
Here is the script
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
col_list = df.select_dtypes(include = "object").columns
for colsn in col_list:
df[colsn] = le.fit_transform(df[colsn].astype(str))
if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python
def stringtocategory(dataset):
'''
#author puja.sharma
#see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data
#param dataset dataframe on whoes column the label encoding has to be done
#return label encoded and inverse tranform of the label encoded data.
'''
data_original = dataset[:]
data_tranformed = dataset[:]
for y in dataset.columns:
#check the dtype of the column object type contains strings or chars
if (dataset[y].dtype == object):
print("The string type features are : " + y)
le = preprocessing.LabelEncoder()
le.fit(dataset[y].unique())
#label encoded data
data_tranformed[y] = le.transform(dataset[y])
#inverse label transform data
data_original[y] = le.inverse_transform(data_tranformed[y])
return data_tranformed,data_original
Mainly used #Alexander answer but had to make some changes -
cols_need_mapped = ['col1', 'col2']
mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)}
for col in df[cols_need_mapped]}
for c in cols_need_mapped :
df[c] = df[c].map(mapper[c])
Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
train=pd.read_csv('.../train.csv')
#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object
def MultiLabelEncoder(columnlist,dataframe):
for i in columnlist:
labelencoder_X=LabelEncoder()
dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)
Here i am reading a csv from location and in function i am passing the column list i want to labelencode and the dataframe I want to apply this.
If you have all the features of type object then the first answer written above works well https://stackoverflow.com/a/31939145/5840973.
But, Suppose when we have mixed type columns. Then we can fetch the list of features names of type object type programmatically and then Label Encode them.
#Fetch features of type Object
objFeatures = dataframe.select_dtypes(include="object").columns
#Iterate a loop for features of type object
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for feat in objFeatures:
dataframe[feat] = le.fit_transform(dataframe[feat].astype(str))
dataframe.info()
The problem is the shape of the data (pd dataframe) you are passing to the fit function.
You've got to pass 1d list.
How about this?
def MultiColumnLabelEncode(choice, columns, X):
LabelEncoders = []
if choice == 'encode':
for i in enumerate(columns):
LabelEncoders.append(LabelEncoder())
i=0
for cols in columns:
X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
i += 1
elif choice == 'decode':
for cols in columns:
X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
i += 1
else:
print('Please select correct parameter "choice". Available parameters: encode/decode')
It is not the most efficient, however it works and it is super simple.
Here is my solution to your problem. In order to convert your data-frame column containing text to encoded values just use my function text_to_numbers it returns a dictonary of LE. Key is the column name that column LabelEncoder() as a value.
def text_to_numbers(df):
le_dict = dict()
for i in df.columns:
if df[i].dtype not in ["float64", "bool", "int64"]:
le_dict[i] = preprocessing.LabelEncoder()
df[i] = le_dict[i].fit_transform(df[i])
return df, le_dict
The function below will make it possible to retain an original unencoded dataframe.
def numbers_to_text(df, le_dict):
for i in le_dict.keys():
df[i] = le_dict[i].inverse_transform(df[i])
return df
Here is my solution to transform multiple columns in one-go, along with the accurate inverse_transformation
from sklearn import preprocessing
columns = ['buying','maint','lug_boot','safety','cls'] # columns names where transform is required
for X in columns:
exec(f'le_{X} = preprocessing.LabelEncoder()') #create label encoder with name "le_X", where X is column name
exec(f'df.{X} = le_{X}.fit_transform(df.{X})') #execute fit transform for column X with respective lable encoder "le_X", where X is column name
df.head() # to display transformed results
for X in columns:
exec(f'df.{X} = le_{X}.inverse_transform(df.{X})') #execute inverse_transform for column X with respective lable encoder "le_X", where X is column name
df.head() # to display Inverse transformed results of df

Categories

Resources