Merge Pandas Dataframe on a column with structured data - python

Scenario: Following up from a previous question on how to read an excel file from a serve into a dataframe (How to read an excel file directly from a Server with Python), I am trying to merge the contexts of multiple dataframes (which contain data from excel worksheets).
Issue: Even after searching for similar issues here in SO, I still was not able to solve the problem.
Format of data (each sheet is read into a dataframe):
Sheet 1 (db1)
Name CUSIP Date Price
A XXX 01/01/2001 100
B AAA 02/05/2005 90
C ZZZ 03/07/2006 95
Sheet2 (db2)
Ident CUSIP Value Class
123 XXX 0.5 AA
444 AAA 1.3 AB
555 ZZZ 2,8 AC
Wanted output (fnl):
Name CUSIP Date Price Ident Value Class
A XXX 01/01/2001 100 123 0.5 AA
B AAA 02/05/2005 90 444 1.3 AB
C ZZZ 03/07/2006 95 555 2.8 AC
What I already tried: I am trying to use the merge function to match each dataframe, but I am getting the error on the "how" part.
fnl = db1
fnl = fnl.merge(db2, how='outer', on=['CUSIP'])
fnl = fnl.merge(db3, how='outer', on=['CUSIP'])
fnl = fnl.merge(bte, how='outer', on=['CUSIP'])
I also tried the concatenate, but I just get a list of dataframes, instead of a single output.
wsframes = [db1 ,db2, db3]
fnl = pd.concat(wsframes, axis=1)
Question: What is the proper way to do this operation?

It seems you need:
from functools import reduce
#many dataframes
dfs = [df1,df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name CUSIP Date Price Ident Value Class
0 A XXX 01/01/2001 100 123 0.5 AA
1 B AAA 02/05/2005 90 444 1.3 AB
2 C ZZZ 03/07/2006 95 555 2,8 AC
But columns in each dataframe has to be different (no matched columns (CUSIP here)), else get _x and _y suffixes:
dfs = [df1,df1, df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name_x CUSIP Date_x Price_x Name_y Date_y Price_y Ident Value \
0 A XXX 01/01/2001 100 A 01/01/2001 100 123 0.5
1 B AAA 02/05/2005 90 B 02/05/2005 90 444 1.3
2 C ZZZ 03/07/2006 95 C 03/07/2006 95 555 2,8
Class
0 AA
1 AB
2 AC

Related

Efficiently count occurrences of a value with groupby and within a date range

I have code that can do this, but I am iterating through each row of the dataframe with iterrows(). It takes quite a long time to process considering it's checking through over 6M rows. And want to use vectorisation to speed it up.
I've looked at using pd.Grouper and freq, but have gotten stuck on how to use the 2 dataframes to do this check with that.
Given the 2 dataframes below:
I want to look at all rows in df1 (grouped by 'sid' and 'modtype'):
df1:
sid servid date modtype service
0 123 881 2022-07-05 A1 z
1 456 879 2022-07-02 A2 z
Then find them in df2 and count the occurrences of those groups within 3 days of the date of that group in df1, to get a count of how many times that group comes within 3 days before, and a count of occurrences it comes within 3 days after.
df2:
sid servid date modtype
0 123 1234 2022-07-03 A1
1 123 881 2022-07-05 A1
2 123 65781 2022-07-06 A1
3 123 8552 2022-07-30 A1
4 123 3453 2022-07-04 A2
5 123 5681 2022-07-07 A2
6 456 78 2022-07-01 A1
7 456 26744 2022-05-05 A2
8 456 56166 2022-06-29 A2
9 456 56717 2022-06-30 A2
10 456 879 2022-07-02 A2
11 456 56 2022-07-25 A2
So, essentially, in the sample set which I provide below, my output would end up with:
sid servid date modtype service cnt_3day_before cnt_3day_after
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 2 0
Sample set:
import pandas as pd
data1 = {
'sid':['123','456'],
'servid':['881','879'],
'date':['2022-07-05','2022-07-02'],
'modtype':['A1','A2'],
'service':['z','z']}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
data2 = {
'sid':['123','123','123','123','123','123',
'456','456','456','456','456','456'],
'servid':['1234','3453','881','65781','5681','8552',
'26744','56717','879','56166','56','78'],
'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
'2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
'modtype':['A1','A2','A1','A1','A2','A1',
'A2','A2','A2','A2','A2','A1']}
df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
Annotated code
# Merge the dataframes on sid and modtype
keys = ['sid', 'modtype']
s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])
# Create boolean condtitions as per requirements
s['cnt_3day_after'] = s['date'].between(s['date_'], s['date_'] + pd.DateOffset(days=3), inclusive='right')
s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )
# group the boolean conditions by sid and modtype
# and aggregate with sum to count the number of True values
s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()
# Join the aggregated counts back with df1
df_out = df1.join(s, on=keys)
Result
print(df_out)
sid servid date modtype service cnt_3day_after cnt_3day_before
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 0 2
I think there are definitely exist faster solutions, but you can try this one. It iterates over "queries" in df1 and for each query computes number of events in df2 that happened before and after 3 days. To calculate number of such events we first set sid and modtype to be an index column, then we select matching events by index and calculate time difference between the selected events and query, then we just count ones that happened in +/- 3 days. This place can be optimized with binary search to give you O(logN) instead of O(N) complexity in case you have sorted date column.
df2 = df2.set_index(['sid', 'modtype'])
seconds_in_3days = 3*24*60*60
def before_and_after_3days(query):
dates = df2.loc[tuple(query[['sid', 'modtype']]), 'date']
secs = (dates - query['date']).dt.total_seconds().astype(int)
before = ((-seconds_in_3days <= secs) & (secs < 0)).sum()
after = ((0 < secs) & (secs < seconds_in_3days)).sum()
return before, after
before_after = df1.apply(before_and_after_3days, axis=1)
df1[['cnt_3day_before', 'cnt_3day_after']] = before_after.tolist()
Use a groupby to allow a processing function to gain access to each unique sid and modtype sub-frame. In that function do some date arithmetic to build and sum boolean maps to get counts of the applicable days before and after. Then merge back into the original target filtering frame (df1).
Processing Function
def a(x):
s = x['sid_y'].isna()
if s.all():
rc = [0,0]
else:
bdate = x.at[(~s).idxmax(),'date']
td = pd.Timedelta('3D')
nb_before = ((bdate > x['date']) & (bdate - x['date'] <= td)).sum()
nb_after = ((bdate < x['date']) & (x['date'] - bdate < td)).sum()
rc = [nb_before,nb_after]
return pd.Series(rc, index=['3before','3after'])
Execution
mk = ['sid','modtype']
dfc = df1.merge(df2.merge(df1, how='left', on='date', suffixes=['', '_y'])
.groupby(mk).apply(a).reset_index(), on=mk)
print(dfc)
Result
sid servid date modtype service 3before 3after
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 2 0

Pandas group by rows selection based on condition

I need to select the row within pandas group by based on condition.
Condition1 # For a given group R1,R2,W, if TYPE(A) amount2 is equal to TYPE(B) row, we need to bring the complete TYPE(A) row as output.
Condition2 # For a given group R1,R2,W, if TYPE(A) row amount2 is not equal to TYPE(B) row amount2 , we need sum up the amount1 & amount2 of both TYPE (A) & (B) rows & we need to bring the remaining columns from the TYPE(A) row as output.
Input dataframe
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 1 B 111 222 D 2.5
2 123 12 2 A 222 222 A 1.5
3 123 12 2 B 333 333 D 2.5
4 123 12 3 A 444 444 D 2.5
5 123 12 3 B 333 333 E 3.5
Expected output
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5
First is necessary get all groups with amount1 equal by amount2 by reshape with DataFrame.set_index and DataFrame.unstack, compare selected columns by DataFrame.xs with DataFrame.eq and for test if all columns match is used DataFrame.all, last use DataFrame.merge for same length like original:
df1 = df.set_index(['R1','R2','W','TYPE'])['amount2'].unstack()
m = df1['A'].eq(df1['B']).rename('m')
m = df.join(m, on=['R1','R2','W'])['m']
Then for match rows (here first group) filter by boolean indexing only A rows chained by & for bitwise AND:
df2 = df[m & df['TYPE'].eq('A')]
print (df2)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
Then filter all another groups by inverted mask with ~ and aggregate by GroupBy.agg all columns with GroupBy.first and amount columns by sum:
cols = df.columns.difference(['R1','R2','W','amount1','amount2'])
d1 = dict.fromkeys(['amount1','amount2'], 'sum')
d2 = dict.fromkeys(cols, 'first')
df3 = df[~m].groupby(['R1','R2','W'], as_index=False).agg({**d1, **d2}).assign(TYPE='A')
print (df3)
R1 R2 W amount1 amount2 Exchange Status TYPE
0 123 12 2 555 555 1.5 A A
1 123 12 3 777 777 2.5 D A
Last join together by concat and if necessary sorting by DataFrame.sort_values:
df4 = pd.concat([df2, df3], ignore_index=True, sort=False).sort_values(['R1','R2','W'])
print (df4)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5
Another solution:
#get the rows for A for each grouping
#assumption is TYPE is already sorted with A always ahead of B
core = ['R1','R2','W']
A = df.groupby(core).first()
#get rows for B for each grouping
B = df.groupby(core).last()
#first condition
cond1 = (A.amount1.eq(B.amount1)) & (A.amount2.eq(B.amount2))
#extract outcome from A to get the first part
part1 = A.loc[cond1]
#second condition
cond2 = A.amount2.ne(B.amount2)
#add the 'amount1' and 'amount 2' columns based on the second condition
part2 = B.loc[cond2].filter(['amount1','amount2']) +
A.loc[cond2].filter(['amount1','amount2'])
#merge with A to get the remaining columns
part2 = part2.join(A[['TYPE','Status','Exchange']])
#merge part1 and 2 to get final result
pd.concat([part1,part2]).reset_index()
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5

Applying math to columns where rows hold the same value in pandas

I have 2 dataframes which look like this:
df1
A B
AAA 50
BBB 100
CCC 200
df2
C D
CCC 500
AAA 10
EEE 2100
I am trying to output the dataset where column E would be B - D if A = C. Since A values are not aligned with C values I cant seem to find the appropriate method to apply calculations and compare the right numbers.
There also are values which are not shared between two datasets in this case I want to add text value 'not found' in those places so that the output would look like this:
output
A B C D E
AAA 50 AAA 10 B-D
BBB 100 Not found Not found Not found
CCC 200 CCC 500 B-D
Not found Not found EEE 2100 Not found
Thank you for your suggestions.
Use outer join with left_on and right_on parameters with DataFrame.merge and then subtract columns, for possible subtract numeric is better use missing values:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.fillna({'A':'Not found', 'C':'Not found'})
.assign(E = lambda x: x.B - x.D))
print (df)
A B C D E
0 AAA 50.0 AAA 10.0 40.0
1 BBB 100.0 Not found NaN NaN
2 CCC 200.0 CCC 500.0 -300.0
3 Not found NaN EEE 2100.0 NaN
Last is possible replace all missing values, only numeric columns are now mixed - strings with numbers, so next processing like some arithmetic operations is problematic:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.assign(E = lambda x: x.B - x.D)
.fillna('Not found'))
print (df)
A B C D E
0 AAA 50 AAA 10 40
1 BBB 100 Not found Not found Not found
2 CCC 200 CCC 500 -300
3 Not found Not found EEE 2100 Not found

add values of two columns from 2 different dataframes pandas

i want to add 2 columns of 2 different dataframes based on the condition that name is same:
import pandas as pd
df1 = pd.DataFrame([("Apple",2),("Litchi",4),("Orange",6)], columns=['a','b'])
df2 = pd.DataFrame([("Apple",200),("Orange",400),("Litchi",600)], columns=['a','c'])
now i want to add column b and c if the name is same in a.
I tried this df1['b+c']=df1['b']+df2['c'] but it simply adds column b and c so the result comes as
a b b+c
0 Apple 2 202
1 Litchi 4 404
2 Orange 6 606
but i want to
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
i guess i have to use isin but i am not getting how?
Columns b and c are aligned by index values in sum operation, so is necessary create index by DataFrame.set_index by column a:
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = (s1+s2).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
EDIT: If need original value for not matched values use Series.add with parameter fill_value=0
df2 = pd.DataFrame([("Apple",200),("Apple",400),("Litchi",600)], columns=['a','c'])
print (df2)
a c
0 Apple 200
1 Apple 400
2 Litchi 600
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = s1.add(s2, fill_value=0).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202.0
1 Apple 402.0
2 Litchi 604.0
3 Orange 6.0

Difference between rows in pandas

I have data in csv which i am reading with pandas. The data is of this format-
name company income saving
A AA 100 10
B AA 200 20
I wish to create a new row with name A, company AA and income and saving being difference of A and B.
Expected output-
name company income saving
A AA -100 -10
I believe need:
print (df)
name company income saving
0 A AA 100 10
1 B AA 200 20
2 C AA 300 40
#for select columns by names
df1 = df[['name','company']].join(df[['income','saving']].diff(-1))
#for select columns by positions
#df1 = df.iloc[:, :2].join(df.iloc[:, 2:].diff(-1))
print (df1)
name company income saving
0 A AA -100.0 -10.0
1 B AA -100.0 -20.0
2 C AA NaN NaN

Categories

Resources