Merging Two Columns in DataFrame With Variable Column Names - python

Editing my original post to hopefully simplify my question... I'm merging multiple DataFrames into one, SomeData.DataFrame, which gives me the following:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03-03
0 A 80 NaN NaN 80
1 B NaN NaN 45 36
2 C 44 NaN 39 NaN
3 D 80 NaN NaN 12
4 E 49 2 NaN NaN
What I'm trying to do now is efficiently merge the columns ending in "_x" and "_y" while keeping everything else in place so that I get:
Key 2019-02-17 2019-02-24 2019-03-03
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 NaN
3 D 80 NaN 12
4 E 49 2 NaN
The other issue I'm trying to account for is that the data contained in SomeData.DataFrame changes weekly so that my column headers are unpredictable. Meaning, some weeks I may not have the above issue at all and other weeks, there may be multiple instances for example:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03_10_x 2019-03-10_y
0 A 80 NaN NaN 80 NaN
1 B NaN NaN 45 36 NaN
2 C 44 NaN 39 NaN 12
3 D 80 NaN NaN 12 NaN
4 E 49 2 NaN NaN 17
So that again the desired result would be:
Key 2019-02-17 2019-02-24 2019-03_10
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 12
3 D 80 NaN 12
4 E 49 2 17
Is what I'm asking reasonable or am I venturing outside the bounds of Pandas' limits? I can't find anyone trying to do anything similar so I'm not sure anymore. Thank you in advance!

Edited answer to updated question:
df = df.set_index('Key')
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-03
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 0.0
D 80.0 0.0 12.0
E 49.0 2.0 0.0
Second dataframe Output:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-10
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 12.0
D 80.0 0.0 12.0
E 49.0 2.0 17.0
You could try something like this:
df_t = df.T
df_t.set_index(df_t.groupby(level=0).cumcount(), append=True)\
.unstack().T\
.sort_values(df.columns[0])[df.columns.unique()]\
.reset_index(drop=True)
Output:
val03-20 03-20 val03-24 03-24
0 a 1 d 5
1 b 6 e 7
2 c 4 f 10
3 NaN NaN g 5
4 NaN NaN h 6
5 NaN NaN i 1

Related

Pandas: grab positions in dataframe which indexes are listed in another dataframe

Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are:
vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC'))
indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC'))
>>> vals
A B C
0 64 20 48
1 28 60 81
2 5 73 77
3 74 66 86
4 41 39 21
5 65 37 98
6 10 20 73
7 6 70 3
8 36 29 28
9 43 13 12
>>> indexes
A B C
0 4 2 3
1 3 3 8
2 5 1 7
3 9 8 9
4 2 4 0
I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later.
This is what I came up with:
vals_indexes = pd.DataFrame()
for i in range(vals.shape[1]):
vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1)
>>> vals_indexes
A B C
0 NaN NaN 48.0
1 NaN 60.0 NaN
2 5.0 73.0 NaN
3 74.0 66.0 86.0
4 41.0 39.0 NaN
5 65.0 NaN NaN
7 NaN NaN 3.0
8 NaN 29.0 28.0
9 43.0 NaN 12.0
Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan
for i in vals.columns:
vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan
print(vals)
A B C
0 NaN 2.0 NaN
1 NaN 5.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
4 NaN NaN 6.0
5 9.0 NaN NaN
6 NaN NaN 4.0
7 NaN 7.0 NaN
8 2.0 NaN NaN
9 NaN NaN NaN

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

Is there a way to replace a whole pandas dataframe row using ffill, if one value of a specific column is NaN?

I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0

How to find last index in Pandas Data Frame row and count backwards using column information?

For example:
If I have a data frame like this:
20 40 60 80 100 120 140
1 1 1 1 NaN NaN NaN NaN
2 1 1 1 1 1 NaN NaN
3 1 1 1 1 NaN NaN NaN
4 1 1 NaN NaN 1 1 1
How do I find the last index in each row and then count the difference in columns elapsed so I get something like this?
20 40 60 80 100 120 140
1 40 20 0 NaN NaN NaN NaN
2 80 60 40 20 0 NaN NaN
3 60 40 20 0 NaN NaN NaN
4 20 0 NaN NaN 40 20 0
You can try of Transposing the dataframe, then after count only not null values and last set 1
#bit of complex procedure, solution involving with.
def fill_values(df):
df = df[::-1]
a = df == 1
b = a.cumsum()
#Function in counting the cummulative not null values
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
return (b-b.mask(a).ffill().fillna(0).astype(int))[::-1]*20
df.apply(fill_values,1).replace(0,np.nan)-20
Out:
20 40 60 80 100 120 140
1 40.0 20.0 0.0 NaN NaN NaN NaN
2 80.0 60.0 40.0 20.0 0.0 NaN NaN
3 60.0 40.0 20.0 0.0 NaN NaN NaN
4 20.0 0.0 NaN NaN 40.0 20.0 0.0

Problem with merging Pandas Dataframes with Columns that don't line up

I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!
Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN

Categories

Resources