AIF360 AI Fairness Library - Converting Pandas DataFrame with StandardDataset()

AIF360 AI Fairness Library - Converting Pandas DataFrame with StandardDataset() - python

In the code below, I have included 5 records for reproducibility. Most of the parameters that I am using are directly from the source code example of instantiated the COMPAS dataset, but I cannot convert the DataFrame into a StandardDataset as it raises a KeyError on 'sex' in the protected_attribute_names. I can use any other column name in that parameter list and I still will end up with the same KeyError (race, for example. I also tried an integer in case it was looking at row information. Still raised the key error).
Python: 3.8.10
Pandas: 1.3.5
AIF360: 0.5.0
WARNING:root:Missing Data: 2 rows removed from StandardDataset.
-
KeyError Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3360 try:
\-\> 3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'race'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
[\<ipython-input-103-af5cab624a37\>](https://localhost:8080/#) in \<module\>
1 from aif360.datasets import StandardDataset
2
\----\> 3 aif = StandardDataset(test_df,
4 label_name='jail7',
5 favorable_classes=\[0\],
[/usr/local/lib/python3.8/dist-packages/aif360/datasets/standard_dataset.py](https://localhost:8080/#) in __init__(self, df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name, scores_name, categorical_features, features_to_keep, features_to_drop, na_values, custom_preprocessing, metadata)
113 if callable(vals):
114 df\[attr\] = df\[attr\].apply(vals)
\--\> 115 elif np.issubdtype(df\[attr\].dtype, np.number):
116 # this attribute is numeric; no remapping needed
117 privileged_values = vals
[/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in __getitem__(self, key)
3456 if self.columns.nlevels \> 1:
3457 return self.\_getitem_multilevel(key)
\-\> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = \[indexer\]
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
\-\> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'race'
import pandas as pd
from aif360.datasets import StandardDataset
data = {'age': {0: 69, 1: 34, 2: 24, 3: 23, 4: 43},
'age_cat': {0: 'Greater than 45', 1: '25 - 45', 2: 'Less than 25', 3: 'Less than 25', 4: '25 - 45'},
'sex': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'race': {0: 'Other', 1: 'African-American', 2: 'African-American', 3: 'African-American', 4: 'Other'},
'c_charge_degree': {0: 'Felony', 1: 'Felony', 2: 'Felony', 3: 'Felony', 4: 'Felony'},
'priors_count': {0: 0, 1: 0, 2: 4, 3: 1, 4: 2},
'days_b_screening_arrest': {0: -1.0, 1: -1.0, 2: -1.0, 3: nan, 4: nan},
'decile_score': {0: 1, 1: 3, 2: 4, 3: 8, 4: 1},
'score_text': {0: 'Low', 1: 'Low', 2: 'Low', 3: 'High', 4: 'Low'},
'is_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'two_year_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'hours_in_jail': {0: 23.627222222222223, 1: 241.85722222222222, 2: 26.058333333333334, 3: nan, 4: nan},
'jail7': {0: False, 1: False, 2: False, 3: True, 4: False}}
df = pd.DataFrame.from_dict(data)
aif = StandardDataset(df,
label_name='jail7',
favorable_classes=[0],
protected_attribute_names=['sex', 'race'],
privileged_classes=[['Female'], ['Caucasian']],
categorical_features=['age_cat', 'sex', 'c_charge_degree', 'score_text', 'race'],
features_to_keep=['age', 'age_cat', 'sex', 'race', 'c_charge_degree', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'score_text', 'is_recid', 'two_year_recid', 'hours_in_jail', 'jail7'])
I changed the values within the protected_attribute names, tried to reduce the length of the list from 2 down to 1. Tried to parse it without values (they're required).

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways

This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).

You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000

Divide dataframe column by column from another dataframe

I have two dataframes that look similar and I want to divide one column from df1 by a column from df2.
Some sample data is below:
dict1 = {'category': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0},
'Id': {0: 24108, 1: 24307, 2: 24307, 3: 24411, 4: 24411},
'count': {0: 3, 1: 2, 2: 33, 3: 98, 4: 33}}
df1 = pd.DataFrame(dict1)
dict2 = {'Id': {0: 24108, 1: 24307, 2: 24411},
'count': {0: 3, 1: 35, 2: 131}}
df2 = pd.DataFrame(dict2)
I am trying to create a new column in the first dataframe (df1) called weights by dividing df1['count'] by df2['count']. Except for the column category and count in both dfs, the values are identical in the other columns.
I have the following piece of code, but I cannot seem to understand where the error is:
df1['weights'] = (df1['count']
.div(df1.merge(df2, on = 'Id', how = 'left')
['count'].to_numpy())
)
I get the following error when I run the code:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
/opt/conda/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/opt/conda/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'count'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_354/1318629977.py in <module>
1 complete['weights'] = (complete['count']
----> 2 .div(complete.merge(totals, on = 'companyId', how = 'left')['count'].to_numpy())
3 )
Any ideas why this is happening?

Since you end up with count_x and count_y after your merge, you need to specify which one you want:
df1['weights'] = (df1['count'].div(df1.merge(df2, on = 'Id', how = 'left')['count_y'].to_numpy()))

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

I am performing df.apply() on a dataframe and I am getting the following error:
IndexingError: ('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).', 'occurred at index 4061')
This error comes from the following line of my df (at index 4061)
The relevant code is:
i = pd.DataFrame()
i = df1.apply(
lambda row: i.append(
df.loc[
(df1["ID"] == row["ID"])
& (df1["Date"] >= (row["Date"] + timedelta(-5)))
& (df1["Date"] <= (row["Date"] + timedelta(20)))
],
ignore_index=True,
inplace=True,
)
if row["Flag"] == 1
else None,
axis=1,
)
And an example of the first 5 rows of the df on which I am using the function:
{'ID': {1: 'A US Equity',
2: 'A US Equity',
3: 'A US Equity',
4: 'A US Equity',
5: 'A US Equity'},
'Date': {1: Timestamp('2020-12-22 00:00:00'),
2: Timestamp('2020-12-23 00:00:00'),
3: Timestamp('2020-12-24 00:00:00'),
4: Timestamp('2020-12-28 00:00:00'),
5: Timestamp('2020-12-29 00:00:00')},
'PX_Last': {1: 117.37, 2: 117.3, 3: 117.31, 4: 117.83, 5: 117.23},
'Short_Int': {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0},
'Total_Call_Volume': {1: 187.0, 2: 353.0, 3: 141.0, 4: 467.0, 5: 329.0},
'Total_Put_Volume': {1: 54.0, 2: 30.0, 3: 218.0, 4: 282.0, 5: 173.0},
'Put_OI': {1: 13354.0, 2: 13350.0, 3: 13522.0, 4: 13678.0, 5: 13785.0},
'Call_OI': {1: 8923.0, 2: 8943.0, 3: 8973.0, 4: 9075.0, 5: 9040.0},
'pct_chng': {1: -0.34810663949736975,
2: -0.059640453267451043,
3: 0.008525149190119485,
4: 0.4432699684596253,
5: -0.5092081812781091},
'Short_Int_Category': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Put/Call': {1: 0.2887700534759358,
2: 0.08498583569405099,
3: 1.5460992907801419,
4: 0.6038543897216274,
5: 0.5258358662613982},
'10% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'10%-20% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'20%-30% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'30% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Time_to_pop': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan}}
The row at index 4061 that is causing the error is:
ID ADI US Equity
Date 2021-02-24 00:00:00
PX_Last 161.76
Short_Int 15.1847
Total_Call_Volume 52502
Total_Put_Volume 1929
Put_OI 32219
Call_OI 45557
pct_chng 2.57451
Short_Int_Category 15-20
Put/Call 0.0367415
10% + Pop Flag 0
10%-20% Pop Flag 0
20%-30% Pop Flag 0
30% + Pop Flag 0
Flag 1
Time_to_pop NaN
Name: 4061, dtype: object
How do I perform the function without getting the error mentioned above?

How to convert mulitple columns of a df from hexadecimal to decimal

There are multiple columns in the df, out of which only selected columns has to be converted from hexa decimal to decimal
Selected column names are stored in a list A = ["Type 2", "Type 4"]
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'CC',
3: '55',
4: '88',
5: '96',
6: 'FF',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}

Say, you have the string "AA" in hex.
You can convert hex to decimal like this:
str(int("AA", 16))
Similarly, for a dataframe column that has hexadecimal values, you can use a lambda function.
df['Type2'] = df['Type2'].apply(lambda x: str(int(str(x), 16)))
Assuming, df is the name of the imported dataframe.

You can use pandas.DataFrame.applymap to cast element-wise:
>>> df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
Type 2 Type 4
0 170 35
1 187 65278
2 204 43981
3 85 56797
4 136 3501
5 150 3326
6 255 53058
7 16777215 801
8 65262 0

What's the cause of the Pandas error on read.excel header=None, multiple index_col's

This is related to this SO question: read_excel in pandas giving error for no header and multiple index_col's
But instead of a workaround, I would like to know why this is happening. The data frame:
The data:
{0: {0: nan, 1: nan, 2: nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: nan, 1: nan, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
Reading the data like:
pd.read_excel(file_path, skiprows=3, index_col=[0, 1], header=None)
Does not work:
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Why?

The explanation is given in the full traceback:
....
File "D:\Programme\Python36\lib\site-packages\pandas\io\excel\_base.py", line 473, in parse offset = 1 + header
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
header is set to None and in calculating the offset it tries to add 1 to None which results in a TypeError. I think it's simply a bug.
The following is absolutely without any warranty:
line 473 of ...\Lib\site-packages\pandas\io\excel_base.py should be changed from offset = 1 + header to offset = 1 + header if header is not None else -1 to make multiple index columns work with header=None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AIF360 AI Fairness Library - Converting Pandas DataFrame with StandardDataset() - python

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

Divide dataframe column by column from another dataframe

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

How to convert mulitple columns of a df from hexadecimal to decimal

What's the cause of the Pandas error on read.excel header=None, multiple index_col's

Categories

Resources