Delete rows with conditions (Multi-Index case) - python

I'm new to Stack Overflow and I have this Data Set :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
The variables 'ID' and 'Step' are Multi-index.
I would like to do two things :
FIRST :
If 'Num' is different for the same 'ID', then delete the rows of this ID.
SECONDLY :
For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.
I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).
I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group
But I still have some difficulties.
Could someone please help me?
Expected Output :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
'Step': {0: 'A', 1: 'Bar', 2: 'F'},
'Num': {0: 38, 1: 38, 2: 38},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
'Occurence': {0: 3, 1: 3, 2: 3}})
The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.

Since you stated that ID and Step are in the index, we can do it this way:
df1[df1.sort_values('Date').groupby('ID')['Num']\
.transform(lambda x: (x.nunique() == 1) &
(x.index.get_level_values(1)[-1] == 'F'))]
Output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
How?
First sort the dataframe by 'Date'
Then group the dataframe by ID
Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we
first get the number of unique elements of 'Num' in that
group, if that number is equal to 1, then you know that in that group
all 'Num's are the same and that is True
Secondly, and we get the inner level of the MultiIndex (level=1) and
we check the last value using indexing with [-1], if that value is
equal to 'F' then have a True also

Group the dataframe by column ID
Transform the Num column using nunique to identify the unique values
Transform the Step column using last to check whether the last value per group is F
Combine the boolean masks using logical and and filter the rows
g = df.groupby('ID')
m = g['Num'].transform('nunique').eq(1) & g['Step'].transform('last').eq('F')
print(df[m])
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Alternative approach with groupby and filter but could be less efficient than the above approach
df.groupby('ID').filter(lambda g: g['Step'].iloc[-1] == 'F' and g['Num'].nunique() == 1)
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Note: In case ID and Step are MultiIndex you have to reset the index before using the above proposed solutions.

I don't know if I understood correctly.
But you can try this
import os
import pandas as pd
sheet = pd.read_excel(io="you_file", sheet_name='sheet_name', na_filter=False, header=0 )
list_objects = []
for index,row in sheet.iterrows():
if (row['ID'] != index):
list_objects.append(row)
list_objects will be a list of dict

use groupby to find the rows with 1 occurrence. I drop the rows in the dataframe based on the ID return by the groupby results. I exclude IDs with one occurrence and not include those in the deletion.
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
df.set_index(['ID','Step'],inplace=True)
print(df)
print("If 'Num' is different for the same 'ID', then delete the rows of this ID.")
#exclude id with single occurrences
grouped=df.groupby([df.index.get_level_values(0)]).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
filter=[x for x in df.index.get_level_values(0) if x not in labels]
grouped = df[df.index.get_level_values(0).isin(filter)].groupby([df.index.get_level_values(0),'Num']).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
if len(labels)>0:
df = df.drop(labels=labels, axis=0,level=0)
print(df)
output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
88 F 320 2017-03-02 2
Bar 320 2018-03-02 2

Related

how to compare two dataframes by multiple columns and only append new entries in pandas?

I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.
So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?
Input data:
>>> src_df.to_dict()
{'yyyy_mm': {0: 202001,
1: 202002,
2: 202003,
3: 202002,
4: 202107,
5: 202108,
6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}
>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
1: 202002,
2: 202001,
3: 201901,
4: 201907,
5: 201908,
6: 201901,
7: 201907,
8: 201908},
'state': {0: 'CA',
1: 'NJ',
2: 'NY',
3: 'NY',
4: 'NY',
5: 'NY',
6: 'PA',
7: 'PA',
8: 'PA'},
'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}
My current attempt:
picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 6: 7}}
Then I tried to do following but output is not same as my expected output:
output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1) # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer') # second attempt
but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?
objective
I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?
Here is my desired output that I want to get:
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:
#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]
#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]
#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)
>>> output_df
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7

Merge DataFrame with many-to-many

I have 2 DataFrames containing examples, I would like to see if a example of DataFrame 1 is present in DataFrame 2.
Normally I would aggregate the rows per example and simply merge the DataFrames. Unfortunately the merging has to be done with a "matching table" which has a many-to-many relationship between the keys (id_low vs. id_high).
Simplified example
Matching Table:
Input DataFrames
They are therefore matchable like this:
Expected Output:
Simplified example (for Python)
import pandas as pd
# Dataframe 1 - containing 1 Example
d1 = pd.DataFrame.from_dict({'Example': {0: 'Example 1', 1: 'Example 1', 2: 'Example 1'},
'id_low': {0: 1, 1: 2, 2: 3}})
# DataFrame 2 - containing 1 Example
d2 = pd.DataFrame.from_dict({'Example': {0: 'Example 2', 1: 'Example 2', 2: 'Example 2'},
'id_low': {0: 1, 1: 4, 2: 6}})
# DataFrame 3 - matching table
dm = pd.DataFrame.from_dict({'id_low': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 5, 7: 6, 8: 6},
'id_high': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'B',
6: 'B',
7: 'E',
8: 'F'}})
d1 and d2 are matchable as you can see above.
Expected Output (or similar):
df_output = pd.DataFrame.from_dict({'Example': {0: 'Example 1'}, 'Example_2': {0: 'Example 2'}})
Failed attemps
Aggregation of with matching table translated values then merging. Considerer using Regex with the OR-Operator.
IIUC:
d2.merge(dm)
.merge(d1.merge(dm), on='id_high')\
.groupby(['Example_x','Example_y'])['id_high'].agg(list)\
.reset_index()
Output:
Example_x Example_y id_high
0 Example 2 Example 1 [A, B, E]

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Python: Pandas Dataframe Column Headers Look Strange After Groupby

I implemented the following groupby statement in my code. The purpose of the code below is to provide the minimum date from the "DTIN" column by unique EVENTID.
df_EVENT5_future_2 = df_EVENT5_future.groupby('EVENTID').agg({'DTIN': [np.min]})
df_EVENT5_future_3 = df_EVENT5_future_2.reset_index()
The output table is follows:
EVENTID DTIN
amin
A 1/3/2019
B 1/19/2019
C 2/10/2019
I would like the table to output like this. I don't want the amin to be in the column header.
EVENTID DTIN
A 1/3/2019
B 1/19/2019
C 2/10/2019
Any help is greatly appreciated.
This is as per #Wen's suggestion. You don't need to use agg for this. Simply use groupby.min() and set as_index=False:
result = df.groupby('EVENTID', as_index=False)['DTIN'].min()
Please do not upvote or accept this answer, as this is a duplicate.
Example
df = pd.DataFrame({'DTIN': {0: 4, 1: 3, 2: 9, 3: 1, 4: 2, 5: 5, 6: 6, 7: 5},
'EVENTID': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'C', 5: 'B', 6: 'B', 7: 'C'}})
result = df.groupby('EVENTID', as_index=False)['DTIN'].min()
# EVENTID DTIN
# 0 A 3
# 1 B 1
# 2 C 2

Splitting a DataFrame based on duplicate values in columns

Here's my starting dataframe:
StartDF = pd.DataFrame({'A': {0: 1, 1: 1, 2: 2, 3: 4, 4: 5, 5: 5, 6: 5, 7: 5}, 'B': {0: 2, 1: 2, 2: 4, 3: 2, 4: 2, 5: 4, 6: 4, 7: 5}, 'C': {0: 10, 1: 1000, 2: 250, 3: 100, 4: 550, 5: 100, 6: 3000, 7: 250}})
I need to create a list of individual dataframes based on duplicate values in columns A and B, so it should look like this:
df1 = pd.DataFrame({'A': {0: 1, 1: 1}, 'B': {0: 2, 1: 2}, 'C': {0: 10, 1: 1000}})
df2 = pd.DataFrame({'A': {0: 2}, 'B': {0: 4}, 'C': {0: 250}})
df3 = pd.DataFrame({'A': {0: 4}, 'B': {0: 2}, 'C': {0: 100}})
df4 = pd.DataFrame({'A': {0: 5}, 'B': {0: 2}, 'C': {0: 550}})
df5 = pd.DataFrame({'A': {0: 5, 1: 5}, 'B': {0: 4, 1: 4}, 'C': {0: 100, 1: 3000}})
df6 = pd.DataFrame({'A': {0: 5}, 'B': {0: 5}, 'C': {0: 250}})
I've seen a lot of answers that explain how to DROP duplicates, but I need to keep the duplicate values because the information in column C will usually be different between rows regardless of duplicates in columns A and B. All of the row data needs to be preserved in the new dataframes.
Additional note, the starting dataframe (StartDF) will change in length, so each time this is run, the number of individual dataframes created will be variable. Ultimately, I need to print the newly created dataframes to their own csv files (I know how to do this part). Just need to know how to break out the data from the original dataframe in an elegant way.
You can use a groupby, iterate over each group and build a list using a list comprehension.
df_list = [g for _, g in df.groupby(['A', 'B'])]
print(*df_list, sep='\n\n')
A B C
0 1 2 10
1 1 2 1000
A B C
2 2 4 250
A B C
3 4 2 100
A B C
4 5 2 550
A B C
5 5 4 100
6 5 4 3000
A B C
7 5 5 250

Categories

Resources