Related
In the code below, I have included 5 records for reproducibility. Most of the parameters that I am using are directly from the source code example of instantiated the COMPAS dataset, but I cannot convert the DataFrame into a StandardDataset as it raises a KeyError on 'sex' in the protected_attribute_names. I can use any other column name in that parameter list and I still will end up with the same KeyError (race, for example. I also tried an integer in case it was looking at row information. Still raised the key error).
Python: 3.8.10
Pandas: 1.3.5
AIF360: 0.5.0
WARNING:root:Missing Data: 2 rows removed from StandardDataset.
-
KeyError Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3360 try:
\-\> 3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/\_libs/index.pyx in pandas.\_libs.index.IndexEngine.get_loc()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
pandas/\_libs/hashtable_class_helper.pxi in pandas.\_libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'race'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
[\<ipython-input-103-af5cab624a37\>](https://localhost:8080/#) in \<module\>
1 from aif360.datasets import StandardDataset
2
\----\> 3 aif = StandardDataset(test_df,
4 label_name='jail7',
5 favorable_classes=\[0\],
[/usr/local/lib/python3.8/dist-packages/aif360/datasets/standard_dataset.py](https://localhost:8080/#) in __init__(self, df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name, scores_name, categorical_features, features_to_keep, features_to_drop, na_values, custom_preprocessing, metadata)
113 if callable(vals):
114 df\[attr\] = df\[attr\].apply(vals)
\--\> 115 elif np.issubdtype(df\[attr\].dtype, np.number):
116 # this attribute is numeric; no remapping needed
117 privileged_values = vals
[/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in __getitem__(self, key)
3456 if self.columns.nlevels \> 1:
3457 return self.\_getitem_multilevel(key)
\-\> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = \[indexer\]
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
3361 return self.\_engine.get_loc(casted_key)
3362 except KeyError as err:
\-\> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'race'
import pandas as pd
from aif360.datasets import StandardDataset
data = {'age': {0: 69, 1: 34, 2: 24, 3: 23, 4: 43},
'age_cat': {0: 'Greater than 45', 1: '25 - 45', 2: 'Less than 25', 3: 'Less than 25', 4: '25 - 45'},
'sex': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'race': {0: 'Other', 1: 'African-American', 2: 'African-American', 3: 'African-American', 4: 'Other'},
'c_charge_degree': {0: 'Felony', 1: 'Felony', 2: 'Felony', 3: 'Felony', 4: 'Felony'},
'priors_count': {0: 0, 1: 0, 2: 4, 3: 1, 4: 2},
'days_b_screening_arrest': {0: -1.0, 1: -1.0, 2: -1.0, 3: nan, 4: nan},
'decile_score': {0: 1, 1: 3, 2: 4, 3: 8, 4: 1},
'score_text': {0: 'Low', 1: 'Low', 2: 'Low', 3: 'High', 4: 'Low'},
'is_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'two_year_recid': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'hours_in_jail': {0: 23.627222222222223, 1: 241.85722222222222, 2: 26.058333333333334, 3: nan, 4: nan},
'jail7': {0: False, 1: False, 2: False, 3: True, 4: False}}
df = pd.DataFrame.from_dict(data)
aif = StandardDataset(df,
label_name='jail7',
favorable_classes=[0],
protected_attribute_names=['sex', 'race'],
privileged_classes=[['Female'], ['Caucasian']],
categorical_features=['age_cat', 'sex', 'c_charge_degree', 'score_text', 'race'],
features_to_keep=['age', 'age_cat', 'sex', 'race', 'c_charge_degree', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'score_text', 'is_recid', 'two_year_recid', 'hours_in_jail', 'jail7'])
I changed the values within the protected_attribute names, tried to reduce the length of the list from 2 down to 1. Tried to parse it without values (they're required).
I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways
This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).
You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000
I'm new to Stack Overflow and I have this Data Set :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
The variables 'ID' and 'Step' are Multi-index.
I would like to do two things :
FIRST :
If 'Num' is different for the same 'ID', then delete the rows of this ID.
SECONDLY :
For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.
I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).
I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group
But I still have some difficulties.
Could someone please help me?
Expected Output :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
'Step': {0: 'A', 1: 'Bar', 2: 'F'},
'Num': {0: 38, 1: 38, 2: 38},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
'Occurence': {0: 3, 1: 3, 2: 3}})
The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.
Since you stated that ID and Step are in the index, we can do it this way:
df1[df1.sort_values('Date').groupby('ID')['Num']\
.transform(lambda x: (x.nunique() == 1) &
(x.index.get_level_values(1)[-1] == 'F'))]
Output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
How?
First sort the dataframe by 'Date'
Then group the dataframe by ID
Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we
first get the number of unique elements of 'Num' in that
group, if that number is equal to 1, then you know that in that group
all 'Num's are the same and that is True
Secondly, and we get the inner level of the MultiIndex (level=1) and
we check the last value using indexing with [-1], if that value is
equal to 'F' then have a True also
Group the dataframe by column ID
Transform the Num column using nunique to identify the unique values
Transform the Step column using last to check whether the last value per group is F
Combine the boolean masks using logical and and filter the rows
g = df.groupby('ID')
m = g['Num'].transform('nunique').eq(1) & g['Step'].transform('last').eq('F')
print(df[m])
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Alternative approach with groupby and filter but could be less efficient than the above approach
df.groupby('ID').filter(lambda g: g['Step'].iloc[-1] == 'F' and g['Num'].nunique() == 1)
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Note: In case ID and Step are MultiIndex you have to reset the index before using the above proposed solutions.
I don't know if I understood correctly.
But you can try this
import os
import pandas as pd
sheet = pd.read_excel(io="you_file", sheet_name='sheet_name', na_filter=False, header=0 )
list_objects = []
for index,row in sheet.iterrows():
if (row['ID'] != index):
list_objects.append(row)
list_objects will be a list of dict
use groupby to find the rows with 1 occurrence. I drop the rows in the dataframe based on the ID return by the groupby results. I exclude IDs with one occurrence and not include those in the deletion.
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
df.set_index(['ID','Step'],inplace=True)
print(df)
print("If 'Num' is different for the same 'ID', then delete the rows of this ID.")
#exclude id with single occurrences
grouped=df.groupby([df.index.get_level_values(0)]).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
filter=[x for x in df.index.get_level_values(0) if x not in labels]
grouped = df[df.index.get_level_values(0).isin(filter)].groupby([df.index.get_level_values(0),'Num']).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
if len(labels)>0:
df = df.drop(labels=labels, axis=0,level=0)
print(df)
output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
88 F 320 2017-03-02 2
Bar 320 2018-03-02 2
I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()
Here is the data:
raw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'],
'age': [42, 42, 36, 24, 73],
'preTestScore': [4, 4, 31, 2, 3],
'postTestScore': [25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
Now I want to search through the dataframes and for two column values I want to get the corresponding two column values.
Here is what I tried-
a,b=df['first_name','last_name'].where(df['age','preTestScore']==42,4)
but its getting the error
KeyError Traceback (most recent call last)
E:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('first_name', 'last_name')
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-6-07dc94043416> in <module>()
----> 1 a,b=df['first_name','last_name'].where(df['age','preTestScore']==42,4)
E:\anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):
E:\anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality
E:\anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res
E:\anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
E:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('first_name', 'last_name')
You can use set_index to convert your index to a MultiIndex ('age', 'preTestScore'). Then use pd.DataFrame.loc with row and column labels:
df2 = df.set_index(['age', 'preTestScore'])
cols = ['first_name', 'last_name']
res = df2.loc[(42, 4), cols].values.tolist()
print(res)
[['Jason', 'Miller']]
I think need compare both values and add all for all Trues per rows:
m = (df[['age','preTestScore']].values == np.array([42, 4])).all(axis=1)
print (m)
[ True True False False False]
a = df.loc[m, ['first_name','last_name']].drop_duplicates()
print (a)
first_name last_name
0 Jason Miller
Detail:
print ((df[['age','preTestScore']].values == np.array([42, 4])))
[[ True True]
[ True True]
[False False]
[False False]
[False False]]