Compare two dataframes, one column, and add certain values on match? - python

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks

rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

Related

merge two rows in one row and convert to NA

Dataframe:
0 1 2 3 4 slicing
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN NaN 6
12 NaN Object 3 NaN NaN 12
18 NaN Object 4 NaN NaN 18
23 NaN Object 5 NaN NaN 23
desired output:
0 1 2 3 4 slicing
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2 NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN Object4 NaN NaN NaN 18
23 NaN Object5 NAN NaN NaN 23
library pandas
iterate through each row in the dataset (since there are only NA's and str'Object' with its corresponding str'1-10' number)
replace str numbers with Na and concatenate data in the same row
Code for now:
df= df[df.apply(lambda row: row.astype(str).str.contains('Desk').any().df[row]+df[row], axis=1)]
Index 0 1 2 3 4
0 NaN Desk 1 NaN NaN
5 NaN Desk 2 NaN NaN
10 NaN Desk 3 NaN NaN
15 NaN Desk 4 NaN NaN
20 NaN Desk 5 NaN NaN
Here's what I did:
Using the following dataframe as an example:
0 1 2 3 4 slicing
index
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN A 6
12 NaN Object 3 NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 Stuff Object NaN 5 NaN 23
I perform 4 steps in the below 4 lines of code, when 'Object' exists in column 1: 1) replace nans with nothing; 2) set everything to string type; 3) join the row, to column 1, 4) replace all the other columns with nan
df.loc[df['1']=='Object',['0', '2', '3','4']] = df.loc[df['1']=='Object',['0', '2', '3','4']].fillna('')
df.loc[df['1']=='Object',['0','1', '2', '3','4']] = df.loc[df['1']=='Object',['0','1', '2', '3','4']].astype(str)
df.loc[df['1']=='Object', ['1','0', '2', '3','4']] = df.loc[df['1']=='Object', ['1', '0', '2', '3','4']].agg(''.join, axis=1)
df.loc[df['1'].str.contains('Object', na = False), ['0', '2', '3','4']] = np.nan
df
0 1 2 3 4 slicing
index
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2A NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 NaN ObjectStuff5 NaN NaN NaN 23
If I understand what you are trying to achieve, you should really try to wok with columns instead of iterating. It is way faster. You can try something like this :
import numpy as np
columns = df.columns.tolist()
ix = df[df[columns[1]].str.contains('Object')].index
df.loc[ix:columns[1]] = df.loc[ix:columns[1]]+df.loc[ix:columns[2]]
df.loc[ix:columns[2]] = np.nan

Update multiple columns per row with loop through pandas dataframe

I've reviewed several posts on here about better ways to loop through dataframes, but can't seem to figure out how to apply them to my specific situation.
I have a dataframe of about 2M rows and I need to calculate six statistics for each row, one per column. There are 3 columns so 18 total. However, the issue is that I need to update those stats using a sample of the dataframe so that the mean/median, etc is different per row.
Here's what I have so far:
r = 0
for i in imputed_df.iterrows():
t = imputed_df.sample(n=10)
for (columnName) in cols:
imputed_df.loc[r,columnName + '_mean'] = t[columnName].mean()
imputed_df.loc[r,columnName + '_var'] = t[columnName].var()
imputed_df.loc[r,columnName + '_std'] = t[columnName].std()
imputed_df.loc[r,columnName + '_skew'] = t[columnName].skew()
imputed_df.loc[r,columnName + '_kurt'] = t[columnName].kurt()
imputed_df.loc[r,columnName + '_med'] = t[columnName].median()
But this has been running for two days without finishing. I tried to take a subset of 2000 rows from the original dataframe and even that one has been running for hours.
Is there a better way to do this?
EDIT: Added a sample dataset of what it should look like. each suffixed column should have the calculated value of the subset of 10 rows.
timestamp activityID w2 w3 w4
0 41.21 1.0 -1.34587 9.57245 2.83571
1 41.22 1.0 -1.76211 10.63590 2.59496
2 41.23 1.0 -2.45116 11.09340 2.23671
3 41.24 1.0 -2.42381 11.88590 1.77260
4 41.25 1.0 -2.31581 12.45170 1.50289
The problem is that you do the operation for each column using unnecessary loops.
We could use
DataFrame.agg with DataFrame.unstack and Series.set_axis to get correct names of columns.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 100))).add_prefix('col')
new_serie = df.agg(['sum', 'mean',
'var', 'std',
'skew', 'kurt', 'median']).unstack()
new_df = pd.concat([df, new_serie.set_axis([f'{x}_{y}'
for x, y in new_serie.index])
.to_frame().T], axis=1)
# if new_df already exist:
#new_df.loc[0, :] = new_serie.set_axis([f'{x}_{y}' for x, y in new_serie.index])
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 ... \
0 8 7 6 7 6 5 8 7 8 4 ...
1 8 1 8 7 0 8 8 4 6 1 ...
2 5 6 3 5 4 9 3 0 2 5 ...
3 3 3 3 3 5 4 5 1 3 5 ...
4 7 9 4 5 6 7 0 3 4 6 ...
5 0 5 2 0 8 0 3 7 6 5 ...
6 7 0 1 4 8 9 4 9 2 9 ...
7 0 6 1 0 6 1 3 0 3 4 ...
8 3 6 1 8 3 0 7 6 8 6 ...
9 2 5 8 5 8 4 9 1 9 9 ...
col98_skew col98_kurt col98_median col99_sum col99_mean col99_var \
0 0.456435 -0.939607 3.0 39.0 3.9 6.322222
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
col99_std col99_skew col99_kurt col99_median
0 2.514403 0.402601 1.099343 4.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

Find first N non null values in each row

If I have a pandas dataframe like this:
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
NaN 0 3 6 9 NaN 4 6 1 5 1
NaN NaN 0 1 2 3 5 8 8 NaN 2
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
How do I only keep the first five non null values in each row and set the rest to nan such that I get a dataframe that looks like this:
NaN NaN NaN 0 5 7 2 2 NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN 0 3 6 9 NaN 4 NaN NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN NaN NaN 0 5 7 2 2 NaN NaN Nan
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
You can use:
df.mask(df.notna().cumsum(axis=1).gt(5))

Problem with merging Pandas Dataframes with Columns that don't line up

I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!
Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN

Categories

Resources