Related
In the code below, I have included 5 records for reproducibility. Most of the parameters that I am using are directly from the source code example of instantiated the COMPAS dataset, but I cannot convert the DataFrame into a StandardDataset as it raises a KeyError on 'sex' in the protected_attribute_names. I can use any other column name in that parameter list and I still will end up with the same KeyError (race, for example. I also tried an integer in case it was looking at row information. Still raised the key error).
Python: 3.8.10
Pandas: 1.3.5
AIF360: 0.5.0
WARNING:root:Missing Data: 2 rows removed from StandardDataset.
-
KeyError Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3360 try:
\-\> 3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'race'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
[\<ipython-input-103-af5cab624a37\>](https://localhost:8080/#) in \<module\>
1 from aif360.datasets import StandardDataset
2
\----\> 3 aif = StandardDataset(test_df,
4 label_name='jail7',
5 favorable_classes=\[0\],
[/usr/local/lib/python3.8/dist-packages/aif360/datasets/standard_dataset.py](https://localhost:8080/#) in __init__(self, df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name, scores_name, categorical_features, features_to_keep, features_to_drop, na_values, custom_preprocessing, metadata)
113 if callable(vals):
114 df\[attr\] = df\[attr\].apply(vals)
\--\> 115 elif np.issubdtype(df\[attr\].dtype, np.number):
116 # this attribute is numeric; no remapping needed
117 privileged_values = vals
[/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in __getitem__(self, key)
3456 if self.columns.nlevels \> 1:
3457 return self.\_getitem_multilevel(key)
\-\> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = \[indexer\]
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
\-\> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'race'
import pandas as pd
from aif360.datasets import StandardDataset
data = {'age': {0: 69, 1: 34, 2: 24, 3: 23, 4: 43},
'age_cat': {0: 'Greater than 45', 1: '25 - 45', 2: 'Less than 25', 3: 'Less than 25', 4: '25 - 45'},
'sex': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'race': {0: 'Other', 1: 'African-American', 2: 'African-American', 3: 'African-American', 4: 'Other'},
'c_charge_degree': {0: 'Felony', 1: 'Felony', 2: 'Felony', 3: 'Felony', 4: 'Felony'},
'priors_count': {0: 0, 1: 0, 2: 4, 3: 1, 4: 2},
'days_b_screening_arrest': {0: -1.0, 1: -1.0, 2: -1.0, 3: nan, 4: nan},
'decile_score': {0: 1, 1: 3, 2: 4, 3: 8, 4: 1},
'score_text': {0: 'Low', 1: 'Low', 2: 'Low', 3: 'High', 4: 'Low'},
'is_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'two_year_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'hours_in_jail': {0: 23.627222222222223, 1: 241.85722222222222, 2: 26.058333333333334, 3: nan, 4: nan},
'jail7': {0: False, 1: False, 2: False, 3: True, 4: False}}
df = pd.DataFrame.from_dict(data)
aif = StandardDataset(df,
label_name='jail7',
favorable_classes=[0],
protected_attribute_names=['sex', 'race'],
privileged_classes=[['Female'], ['Caucasian']],
categorical_features=['age_cat', 'sex', 'c_charge_degree', 'score_text', 'race'],
features_to_keep=['age', 'age_cat', 'sex', 'race', 'c_charge_degree', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'score_text', 'is_recid', 'two_year_recid', 'hours_in_jail', 'jail7'])
I changed the values within the protected_attribute names, tried to reduce the length of the list from 2 down to 1. Tried to parse it without values (they're required).
I have two dataframes that look similar and I want to divide one column from df1 by a column from df2.
Some sample data is below:
dict1 = {'category': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0},
'Id': {0: 24108, 1: 24307, 2: 24307, 3: 24411, 4: 24411},
'count': {0: 3, 1: 2, 2: 33, 3: 98, 4: 33}}
df1 = pd.DataFrame(dict1)
dict2 = {'Id': {0: 24108, 1: 24307, 2: 24411},
'count': {0: 3, 1: 35, 2: 131}}
df2 = pd.DataFrame(dict2)
I am trying to create a new column in the first dataframe (df1) called weights by dividing df1['count'] by df2['count']. Except for the column category and count in both dfs, the values are identical in the other columns.
I have the following piece of code, but I cannot seem to understand where the error is:
df1['weights'] = (df1['count']
.div(df1.merge(df2, on = 'Id', how = 'left')
['count'].to_numpy())
)
I get the following error when I run the code:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
/opt/conda/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/opt/conda/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'count'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_354/1318629977.py in <module>
1 complete['weights'] = (complete['count']
----> 2 .div(complete.merge(totals, on = 'companyId', how = 'left')['count'].to_numpy())
3 )
Any ideas why this is happening?
Since you end up with count_x and count_y after your merge, you need to specify which one you want:
df1['weights'] = (df1['count'].div(df1.merge(df2, on = 'Id', how = 'left')['count_y'].to_numpy()))
I am desperatly trying to figure out how to print out the row index and col name for specific values in my df.
I have the following df:
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
I now want to print out the index and column name for the NaN:
There is a missing value in row 0 for first_name.
There is a missing value in row 2 for age.
I have searched a lot and always found how to do something for one row.
My idea is to first create a df with False and True
na = df.isnull()
Then I want to apply some function that prints the row number and col_name for every NaN value.
I just cant figure out how to do this.
Thanks in advance for any help!
had to change the df a bit because of NaN. Replaced with np.nan
import numpy as np
import pandas as pd
raw_data = {'first_name': [np.nan, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, np.nan, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
you can do this
dfs = df.stack(dropna = False)
[f'There is a missing value in row {i[0]} for {i[1]}' for i in dfs[dfs.isna()].index]
prints a list
['There is a missing value in row 0 for first_name',
'There is a missing value in row 2 for age']
As simple as:
np.where(df.isnull())
It returns a tuple with the row indexes, and column indexes with NAs, respectively.
Example:
na_idx = np.where(df.isnull())
for i,j in zip(*na_idx):
print(f'Row {i} and column {j} ({df.columns[j]}) is NA.')
You could do something like the below:
for i, row in df.iterrows():
nans = row[row.isna()].index
for n in nans:
print('row: %s, col: %s' % (i, n))
I think melting is the way to go.
I'd start by creating a dataframe with columns: index, column_name, value.
Then filter column value by not null.
And dump the result to dict.
df = pd.melt(df.reset_index(), id_vars=['index'], value_vars=df.columns)
selected = df[df['value'].isnull()].drop('value', axis=1).set_index('index')
resp = selected.T.to_dict(orient='records')[0]
s = "There is a missing value in row {idx} for {col_name}."
for record in resp.items():
idx, col_name = record
print(s.format(idx=idx, col_name=col_name))
you can just create a variable
NaN = "null"
to indicate empty cell
import pandas as pd
NaN = "null"
raw_data = {'first_name': [NaN, 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, NaN, 24, 73],
'preTestScore': [4, 24, 31, 33, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore','postTestScore'])
print(df)
output:
first_name last_name age preTestScore postTestScore
0 null Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali null 31 57
3 Jake Milner 24 33 62
4 Amy Cooze 73 3 70
This question is similar to this one, but with a crucial difference - the solution to the linked question does not solve the issue when the dataframe is grouped into bins.
The following code to boxplot the relative distribution of the bins of the 2 variables produces an error:
import pandas as pd
import seaborn as sns
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
sns.boxplot(x='regiment', y='preTestScore', data=df1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-241-fc8036eb7d0b> in <module>()
----> 1 sns.boxplot(x='regiment', y='preTestScore', data=df1)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
2209 plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
2210 orient, color, palette, saturation,
-> 2211 width, dodge, fliersize, linewidth)
2212
2213 if ax is None:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
439 width, dodge, fliersize, linewidth):
440
--> 441 self.establish_variables(x, y, hue, data, orient, order, hue_order)
442 self.establish_colors(color, palette, saturation)
443
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
149 if isinstance(input, string_types):
150 err = "Could not interpret input '{}'".format(input)
--> 151 raise ValueError(err)
152
153 # Figure out the plotting orientation
ValueError: Could not interpret input 'regiment'
If I remove the x and y parameters, it produces a boxplot, but its the not the one I want:
How do I fix this? I tried the following:
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
df1 = df1.reset_index()
df1
It now looks like a dataframe, so I thought of extracting the column names of this dataframe and plotting for each one sequentially:
cols = df1.columns[1:len(df1.columns)]
for i in range(len(cols)):
sns.boxplot(x='regiment', y=cols[i], data=df1)
This doesn't look right. In fact, this is not a normal dataframe; if we print out its columns, it does not show regiment as a column, which is why boxplot gives the error ValueError: Could not interpret input 'regiment':
df1.columns
>>> Index(['regiment', 2, 3, 4, 24, 31], dtype='object', name='preTestScore')
So, if I could just somehow make regiment a column of the dataframe, I think I should be able to plot the boxplot of preTestScore vs regiment. Am I wrong?
EDIT: What I want is something like this:
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
# This df2 dataframe is the one I'm trying to construct using groupby
data = {'regiment':['Dragoons', 'Nighthawks', 'Scouts'], 'preTestScore 2':[0.0, 1.0, 2.0], 'preTestScore 3':[1.0, 0.0, 2.0],
'preTestScore 4':[1.0, 1.0, 0.0], 'preTestScore 24':[1.0, 1.0, 0.0], 'preTestScore 31':[1.0, 1.0, 0.0]}
cols = ['regiment', 'preTestScore 2', 'preTestScore 3', 'preTestScore 4', 'preTestScore 24', 'preTestScore 31']
df2 = pd.DataFrame(data, columns=cols)
df2
fig = plt.figure(figsize=(20,3))
count = 1
for col in cols[1:]:
plt.subplot(1, len(cols)-1, count)
sns.boxplot(x='regiment', y=col, data=df2)
count+=1
If you do reset_index() on your dataframe df1, you should get the dataframe you want to have.
The problem was that you have one of your desired columns (regiment) as an index, so you needed to reset it and make it an another column.
Edit: added add_prefix for proper column names in the resulting dataframe
Sample code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
df1 = df1.add_prefix('preTestScore ') # <- add_prefix for proper column names
df2 = df1.reset_index() # <- Here is reset_index()
cols = df2.columns
fig = plt.figure(figsize=(20,3))
count = 1
for col in cols[1:]:
plt.subplot(1, len(cols)-1, count)
sns.boxplot(x='regiment', y=col, data=df2)
count+=1
Output:
I've used the .set_index() function to set the first column as my index to the rows in my dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> df = df.set_index(['index'])
>>> df
a b c d e
index
x 1 2 3 4 5
y 6 7 8 9 10
z 11 12 13 14 15
How should the dataframe be manipulated such that I can be accessed like a nested list? e.g. are the following possible:
>>> df['x']
[1, 2, 3, 4, 5]
>>> df['x']['a']
1
>>> df['x']['a', 'b']
(1, 2)
>>> df['x']['a', 'd', 'c']
(1, 4, 3)
I've tried accessing df['x'] after setting the index but it throws an error, is that the correct way to access the x row?
>>> import pandas as pd
>>> df = pd.DataFrame([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> df = df.set_index(['index'])
>>> df
a b c d e
index
x 1 2 3 4 5
y 6 7 8 9 10
z 11 12 13 14 15
>>> df['x']
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2393, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'x'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2062, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/site-packages/pandas/core/internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'x'
You can use loc.
From your examples:
df['x'] should be df.loc['x']
df['x']['a'] should be df.loc['x', 'a'], and
df['x']['a', 'd', 'c'] should be df.loc['x', ['a', 'd', 'c']]
Accessing rows should be made using loc:
df.loc['x']
Getting row and column should be
df.loc['x', ['a', 'c']]
or you can get the transpose
df.T.x