Problem with merging Pandas Dataframes with Columns that don't line up - python

I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!

Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN

Related

how to detect when a price higher than previous high

I am trying to find when a price value is cross above a high, I can find the high but when I compare it to current price it gives me all 1
my code :
peak = df[(df[‘price’] > df[‘price’].shift(-1)) & (df[‘price’] > df[‘price’].shift(1))]
df[‘peak’] = peak
df[‘breakout’] = df[‘price’] > df[‘peak’]
print(df)
out :
price
peak
breakout
1
2
NaN
1
2
2
NaN
1
3
4
NaN
1
4
5
NaN
1
5
6
6.0
1
6
5
NaN
1
7
4
NaN
1
8
3
NaN
1
9
12
12.0
1
10
10
NaN
1
11
50
NaN
1
12
100
NaN
1
13
110
110
1
14
84
NaN
1
expect:
price
peak
high
breakout
1
2
NaN
0
0
2
2
NaN
0
0
3
4
NaN
0
0
4
5
NaN
0
0
5
6
6.0
1
1
6
5
NaN
0
0
7
4
NaN
0
0
8
3
NaN
0
0
9
12
12.0
1
1
10
10
NaN
0
0
11
50
NaN
0
1
12
100
NaN
0
1
13
110
110
1
1
14
84
NaN
0
0
with fillna :
price peak look breakout
0 2 NaN NaN False
1 4 NaN NaN False
2 5 NaN NaN False
3 6 6.0 6.0 False
4 5 NaN 6.0 False
5 4 NaN 6.0 False
6 3 NaN 6.0 False
7 12 12.0 12.0 False ----> this should be True because it it higher than 6 and it also the high for shift(-1) and shift(1)
8 10 NaN 12.0 False
9 50 NaN 12.0 True
10 100 100.0 100.0 False
11 40 NaN 100.0 False
12 45 45.0 45.0 False
13 30 NaN 45.0 False
14 200 NaN 45.0 True
Try with pandas.DataFrame.fillna:
df["breakout"] = df["price"] >= df["peak"].fillna(method = "ffill")
If you want it with 1s and 0s add the line:
df["breakout"] = df["breakout"].replace([True, False],[1,0])
Note that df["peak"].fillna(method = "ffill") returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 6.0
5 6.0
6 6.0
7 6.0
8 12.0
9 12.0
10 12.0
11 12.0
12 110.0
13 110.0
Name: peak, dtype: float64
So you can compare it easily with the price column.

Pandas: grab positions in dataframe which indexes are listed in another dataframe

Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are:
vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC'))
indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC'))
>>> vals
A B C
0 64 20 48
1 28 60 81
2 5 73 77
3 74 66 86
4 41 39 21
5 65 37 98
6 10 20 73
7 6 70 3
8 36 29 28
9 43 13 12
>>> indexes
A B C
0 4 2 3
1 3 3 8
2 5 1 7
3 9 8 9
4 2 4 0
I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later.
This is what I came up with:
vals_indexes = pd.DataFrame()
for i in range(vals.shape[1]):
vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1)
>>> vals_indexes
A B C
0 NaN NaN 48.0
1 NaN 60.0 NaN
2 5.0 73.0 NaN
3 74.0 66.0 86.0
4 41.0 39.0 NaN
5 65.0 NaN NaN
7 NaN NaN 3.0
8 NaN 29.0 28.0
9 43.0 NaN 12.0
Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan
for i in vals.columns:
vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan
print(vals)
A B C
0 NaN 2.0 NaN
1 NaN 5.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
4 NaN NaN 6.0
5 9.0 NaN NaN
6 NaN NaN 4.0
7 NaN 7.0 NaN
8 2.0 NaN NaN
9 NaN NaN NaN

Find first N non null values in each row

If I have a pandas dataframe like this:
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
NaN 0 3 6 9 NaN 4 6 1 5 1
NaN NaN 0 1 2 3 5 8 8 NaN 2
NaN NaN NaN 0 5 7 2 2 3 7 8
NaN NaN 0 1 2 3 5 8 8 NaN 4
How do I only keep the first five non null values in each row and set the rest to nan such that I get a dataframe that looks like this:
NaN NaN NaN 0 5 7 2 2 NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN 0 3 6 9 NaN 4 NaN NaN NaN NaN
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
NaN NaN NaN 0 5 7 2 2 NaN NaN Nan
NaN NaN 0 1 2 3 5 NaN NaN NaN NaN
You can use:
df.mask(df.notna().cumsum(axis=1).gt(5))

Compare two dataframes, one column, and add certain values on match?

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks
rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

How to use pandas rolling_sum with sliding windows

I would like to calculate the sum or other calculation with sliding windows.
For example I would like to calculate the sum on the last 10 data point from current position where A is True.
Is there a way to do this ?
With this it didn't return the value that I expect.
I put the expected value and the calculation on the side.
Thank you
In [63]: dt['As'] = pd.rolling_sum( dt.Val[ dt.A == True ], window=10, min_periods=1)
In [64]: dt
Out[64]:
Val A B As
0 1 NaN NaN NaN
1 1 NaN NaN NaN
2 1 NaN NaN NaN
3 1 NaN NaN NaN
4 6 NaN True NaN
5 1 NaN NaN NaN
6 2 True NaN 1 pos 6 = 2
7 1 NaN NaN NaN
8 3 NaN NaN NaN
9 9 True NaN 2 pos 9 + pos 6 = 11
10 1 NaN NaN NaN
11 9 NaN NaN NaN
12 1 NaN NaN NaN
13 1 NaN True NaN
14 1 NaN NaN NaN
15 2 True NaN 3 pos 15 + pos 9 + pos 6 = 13
16 1 NaN NaN NaN
17 8 NaN NaN NaN
18 1 NaN NaN NaN
19 5 True NaN 4 pos 19 + pos 15 = 7
20 1 NaN NaN NaN
21 1 NaN NaN NaN
22 2 NaN NaN NaN
23 1 NaN NaN NaN
24 7 NaN True NaN
25 1 NaN NaN NaN
26 1 NaN NaN NaN
27 1 NaN NaN NaN
28 3 True NaN 5 pos 28 + pos 19 = 8
This almost do it
import numpy as np
import pandas as pd
dt = pd.read_csv('test2.csv')
dt['AVal'] = dt.Val[dt.A == True]
dt['ASum'] = pd.rolling_sum( dt.AVal, window=10, min_periods=1)
dt['ACnt'] = pd.rolling_count( dt.AVal, window=10)
In [4]: dt
Out[4]:
Val A B AVal ASum ACnt
0 1 NaN NaN NaN NaN 0
1 1 NaN NaN NaN NaN 0
2 1 NaN NaN NaN NaN 0
3 1 NaN NaN NaN NaN 0
4 6 NaN True NaN NaN 0
5 1 NaN NaN NaN NaN 0
6 2 True NaN 2 2 1
7 1 NaN NaN NaN 2 1
8 3 NaN NaN NaN 2 1
9 9 True NaN 9 11 2
10 1 NaN NaN NaN 11 2
11 9 NaN NaN NaN 11 2
12 1 NaN NaN NaN 11 2
13 1 NaN True NaN 11 2
14 1 NaN NaN NaN 11 2
15 2 True NaN 2 13 3
16 1 NaN NaN NaN 11 2
17 8 NaN NaN NaN 11 2
18 1 NaN NaN NaN 11 2
19 5 True NaN 5 7 2
20 1 NaN NaN NaN 7 2
21 1 NaN NaN NaN 7 2
22 2 NaN NaN NaN 7 2
23 1 NaN NaN NaN 7 2
24 7 NaN True NaN 7 2
25 1 NaN NaN NaN 5 1
26 1 NaN NaN NaN 5 1
27 1 NaN NaN NaN 5 1
28 3 True NaN 3 8 2
but need to NaN for all the value in ASum and ACount where A is NaN
Is this the way to do it ?
Are you just doing a sum, or is this a simplified example for a more complex problem?
If it's just a sum then you can use a mix of fillna() and the fact that True and False act like 1 and 0 in np.sum:
In [8]: pd.rolling_sum(dt['A'].fillna(False), window=10,
min_periods=1)[dt['A'].fillna(False)]
Out[8]:
6 1
9 2
15 3
19 2
28 2
dtype: float64

Categories

Resources