Merge DataFrame with many-to-many - python

I have 2 DataFrames containing examples, I would like to see if a example of DataFrame 1 is present in DataFrame 2.
Normally I would aggregate the rows per example and simply merge the DataFrames. Unfortunately the merging has to be done with a "matching table" which has a many-to-many relationship between the keys (id_low vs. id_high).
Simplified example
Matching Table:
Input DataFrames
They are therefore matchable like this:
Expected Output:
Simplified example (for Python)
import pandas as pd
# Dataframe 1 - containing 1 Example
d1 = pd.DataFrame.from_dict({'Example': {0: 'Example 1', 1: 'Example 1', 2: 'Example 1'},
'id_low': {0: 1, 1: 2, 2: 3}})
# DataFrame 2 - containing 1 Example
d2 = pd.DataFrame.from_dict({'Example': {0: 'Example 2', 1: 'Example 2', 2: 'Example 2'},
'id_low': {0: 1, 1: 4, 2: 6}})
# DataFrame 3 - matching table
dm = pd.DataFrame.from_dict({'id_low': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 5, 7: 6, 8: 6},
'id_high': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'B',
6: 'B',
7: 'E',
8: 'F'}})
d1 and d2 are matchable as you can see above.
Expected Output (or similar):
df_output = pd.DataFrame.from_dict({'Example': {0: 'Example 1'}, 'Example_2': {0: 'Example 2'}})
Failed attemps
Aggregation of with matching table translated values then merging. Considerer using Regex with the OR-Operator.

IIUC:
d2.merge(dm)
.merge(d1.merge(dm), on='id_high')\
.groupby(['Example_x','Example_y'])['id_high'].agg(list)\
.reset_index()
Output:
Example_x Example_y id_high
0 Example 2 Example 1 [A, B, E]

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways
This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).
You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000

Delete rows with conditions (Multi-Index case)

I'm new to Stack Overflow and I have this Data Set :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
The variables 'ID' and 'Step' are Multi-index.
I would like to do two things :
FIRST :
If 'Num' is different for the same 'ID', then delete the rows of this ID.
SECONDLY :
For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.
I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).
I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group
But I still have some difficulties.
Could someone please help me?
Expected Output :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
'Step': {0: 'A', 1: 'Bar', 2: 'F'},
'Num': {0: 38, 1: 38, 2: 38},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
'Occurence': {0: 3, 1: 3, 2: 3}})
The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.
Since you stated that ID and Step are in the index, we can do it this way:
df1[df1.sort_values('Date').groupby('ID')['Num']\
.transform(lambda x: (x.nunique() == 1) &
(x.index.get_level_values(1)[-1] == 'F'))]
Output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
How?
First sort the dataframe by 'Date'
Then group the dataframe by ID
Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we
first get the number of unique elements of 'Num' in that
group, if that number is equal to 1, then you know that in that group
all 'Num's are the same and that is True
Secondly, and we get the inner level of the MultiIndex (level=1) and
we check the last value using indexing with [-1], if that value is
equal to 'F' then have a True also
Group the dataframe by column ID
Transform the Num column using nunique to identify the unique values
Transform the Step column using last to check whether the last value per group is F
Combine the boolean masks using logical and and filter the rows
g = df.groupby('ID')
m = g['Num'].transform('nunique').eq(1) & g['Step'].transform('last').eq('F')
print(df[m])
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Alternative approach with groupby and filter but could be less efficient than the above approach
df.groupby('ID').filter(lambda g: g['Step'].iloc[-1] == 'F' and g['Num'].nunique() == 1)
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Note: In case ID and Step are MultiIndex you have to reset the index before using the above proposed solutions.
I don't know if I understood correctly.
But you can try this
import os
import pandas as pd
sheet = pd.read_excel(io="you_file", sheet_name='sheet_name', na_filter=False, header=0 )
list_objects = []
for index,row in sheet.iterrows():
if (row['ID'] != index):
list_objects.append(row)
list_objects will be a list of dict
use groupby to find the rows with 1 occurrence. I drop the rows in the dataframe based on the ID return by the groupby results. I exclude IDs with one occurrence and not include those in the deletion.
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
df.set_index(['ID','Step'],inplace=True)
print(df)
print("If 'Num' is different for the same 'ID', then delete the rows of this ID.")
#exclude id with single occurrences
grouped=df.groupby([df.index.get_level_values(0)]).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
filter=[x for x in df.index.get_level_values(0) if x not in labels]
grouped = df[df.index.get_level_values(0).isin(filter)].groupby([df.index.get_level_values(0),'Num']).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
if len(labels)>0:
df = df.drop(labels=labels, axis=0,level=0)
print(df)
output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
88 F 320 2017-03-02 2
Bar 320 2018-03-02 2

Group By Having Count in Pandas

Here is my data:
{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03'},
'Day': {0: '06/08/2018', 1: '05/08/2018', 2: '04/08/2018', 3: '05/08/2018'},
'AlarmClass-S': {0: 4, 1: 2, 2: 4, 3: 0},
'AlarmClass-ELM': {0: 0, 1: 0, 2: 0, 3: 2}}
I would like to perform an aggregation and filtering which in SQL would be formulated as
SELECT SystemID, COUNT(*) as count FROM table GROUP BY SystemID HAVING COUNT(*) > 2
Thus the result shall be
{'SystemID': {0: '5F891F03'},
'count': {0: '3'}}
How to do this in pandas?
You can use groupby and count, then filter at the end.
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2'))
SystemID count
0 5F891F03 3
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2')
.to_dict())
# {'SystemID': {0: '5F891F03'}, 'count': {0: 3}}

Pandas: population new columns from other column's values

I have a pandas.dataframe of SEC reports for multiple tickers & periods.
Reproducible dict for DF:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'field': {0: 'taxonomyid',
1: 'cik',
2: 'companyname',
3: 'entityid',
4: 'primaryexchange'},
'value': {0: '50',
1: '0000023217',
2: 'CONAGRA BRANDS INC.',
3: '6976',
4: 'NYSE'},
'ticker': {0: 'CAG', 1: 'CAG', 2: 'CAG', 3: 'CAG', 4: 'CAG'},
'cik': {0: 23217, 1: 23217, 2: 23217, 3: 23217, 4: 23217},
'dcn': {0: '0000023217-18-000009',
1: '0000023217-18-000009',
2: '0000023217-18-000009',
3: '0000023217-18-000009',
4: '0000023217-18-000009'},
'fiscalyear': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4: 2019},
'fiscalquarter': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'receiveddate': {0: '10/2/2018',
1: '10/2/2018',
2: '10/2/2018',
3: '10/2/2018',
4: '10/2/2018'},
'periodenddate': {0: '8/26/2018',
1: '8/26/2018',
2: '8/26/2018',
3: '8/26/2018',
4: '8/26/2018'}}
The column 'field' contains the name of the reporting field (e.g. Indicator), column 'value' contains value for that indicator. Other columns are description for the SEC filing (ticker+date+fiscal_periods = unique set of features to describe certain filing). There are about 60-70 indicators per filing (number varies).
With the code below I've managed to create a pivot dataframe with columns = features (let say total number of N for 1 submission). But the length of this dataframe also equals the number of indicators = N, with NaN in non-diagonal places.
# Adf - Initial dataframe
c = Adf.pivot(columns='field', values='value')
d = Adf[['ticker','cik','fiscalyear','fiscalquarter','dcn','receiveddate','periodenddate']]
e = pd.concat([d, c], sort=False, axis=1)
I want to use an Indicator names from the 'field' as new columns (going from narrow to wide format). At the end I want to have a dataframe with 1 row for each of SEC reports.
So the expected output for provided example is a 1-row dataframe with N new columns, where N = number of unique indicators from the 'field' column of initial dataframe:
{'ticker': {0: 'CAG'},
'cik': {0: 23217},
'dcn': {0: '0000023217-18-000009'},
'fiscalyear': {0: 2019},
'fiscalquarter': {0: 1},
'receiveddate': {0: '10/2/2018'},
'periodenddate': {0: '8/26/2018'},
'taxonomyid':{0:'50'},
'cik': {0: '0000023217}',
'companyname':{0: 'CONAGRA BRANDS INC.'},
'entityid':{0:'6976'},
'primaryexchange': {0:'NYSE'},
}
What is the proper way to create such columns from or what is the proper way to clean-up resulting dataframe from multiple NaN?
What worked for me is setting new index to DF and unstacking 'field' and 'value' columns
aa = Adf.set_index(['ticker','cik', 'fiscalyear','fiscalquarter', 'dcn','receiveddate', 'periodenddate', 'field']).unstack()
aa = aa.reset_index()

Python: Pandas Dataframe Column Headers Look Strange After Groupby

I implemented the following groupby statement in my code. The purpose of the code below is to provide the minimum date from the "DTIN" column by unique EVENTID.
df_EVENT5_future_2 = df_EVENT5_future.groupby('EVENTID').agg({'DTIN': [np.min]})
df_EVENT5_future_3 = df_EVENT5_future_2.reset_index()
The output table is follows:
EVENTID DTIN
amin
A 1/3/2019
B 1/19/2019
C 2/10/2019
I would like the table to output like this. I don't want the amin to be in the column header.
EVENTID DTIN
A 1/3/2019
B 1/19/2019
C 2/10/2019
Any help is greatly appreciated.
This is as per #Wen's suggestion. You don't need to use agg for this. Simply use groupby.min() and set as_index=False:
result = df.groupby('EVENTID', as_index=False)['DTIN'].min()
Please do not upvote or accept this answer, as this is a duplicate.
Example
df = pd.DataFrame({'DTIN': {0: 4, 1: 3, 2: 9, 3: 1, 4: 2, 5: 5, 6: 6, 7: 5},
'EVENTID': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'C', 5: 'B', 6: 'B', 7: 'C'}})
result = df.groupby('EVENTID', as_index=False)['DTIN'].min()
# EVENTID DTIN
# 0 A 3
# 1 B 1
# 2 C 2

Categories

Resources