Merging two DataFrames based on indexes from two other DataFrames - python

I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA DataB
row item_id valueA row item_id valueB
0 x A1 0 x B1
1 y A2 1 y B2
2 z A3 2 x B3
3 x A4 3 y B4
4 z A5 4 z B5
5 x A6 5 x B6
6 y A7 6 y B7
7 z A8 7 z B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index start_row num_rows
0 0 3
1 3 2
3 5 3
DataB_mapper
global_index start_row num_rows
0 0 2
2 2 3
4 5 3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
In DF DataA rows 0..2 are associated with global_index 0
In DF DataB rows 0..1 are associated with global_index 0
Another example, for a global_index value of 2:
In DF DataB rows 2..4 are associated with global_index 2
In DF DataA there are no rows associated with global_index 2
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row global_index item_id valueA valueB
0 0 x A1 B1
1 0 y A2 B2
2 0 z A3 NaN
3 1 x A4 B1
4 1 z A5 NaN
5 2 x A4 B3
6 2 y A2 B4
7 2 z A5 NaN
8 3 x A6 B3
9 3 y A7 B4
10 3 z A8 B5
11 4 x A6 B6
12 4 y A7 B7
13 4 z A8 B8
In the final datafram any pair of global_index/item_id there will ever be either:
a value for both valueA and valueB
a value only for valueA
a value only for valueB
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.

First, you can create the 'global_index' column using the function pd.cut:
for df, m in [(df_A, map_A), (df_B, map_B)]:
bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby and ffill to fill missing values:
for val in ['valueA', 'valueB']:
df[val] = df.groupby('item_id')[val].ffill()
Output:
item_id global_index valueA valueB
0 x 0 A1 B1
1 y 0 A2 B2
2 z 0 A3 NaN
3 x 1 A4 B1
4 z 1 A5 NaN
5 x 3 A6 B1
6 y 3 A7 B2
7 z 3 A8 NaN
8 x 2 A6 B3
9 y 2 A7 B4
10 z 2 A8 B5
11 x 4 A6 B6
12 y 4 A7 B7
13 z 4 A8 B8

I haven't tested this out, since I don't have any good test data, but I think something like this should work. Basically what this is doing is, rather than trying to pull off some sort of complicated join, it is building a series of lists to hold your data, which you can then put back together into a final dataframe at the end.
DataA.set_index('row')
DataB.set_index('row')
#we're going to create the new dataframe from scratch, creating a list for each column we want
global_index = []
AValues = []
AIndex = []
BValues = []
BIndex = []
for indexNum in totalIndexes:
#for each global index, we get the total number of rows to extract from DataA and DataB
AStart = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'start_row'].values[0]
ARows = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'num_rows'].values[0]
AStop = AStart + Arows
BStart = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'start_row'].values[0]
BRows = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'num_rows'].values[0]
BStop = BStart + Brows
#Next we extract values from DataA and DataB, turn them into lists, and add them to our data
AValues = AValues + list(DataA.iloc[AStart:AStop, 1].values)
AIndex = AIndex + list(DataA.iloc[AStart:AStop, 0].values)
BValues = BValues + list(DataB.iloc[BStart:BStop, 1].values)
BIndex = BIndex + list(DataA.iloc[AStart:AStop, 0].values)
#Create a temporary list of the current global_index, and add it to our data
global_index_temp = []
for row in range(max(ARows,Brows)):
global_index_temp.append(indexNum)
global_index = global_index + global_index_temp
#combine all these individual lists into a dataframe
finalData = list(zip(global_index, AIndex, BIndex, AValues, BValues))
df = pd.DataFrame(data = finalData, columns = ['global_index', 'item1', 'item2', 'valueA', 'valueB'])
#lastly you just need to merge item1 and item2 to get your item_id column
I've tried to comment it nicely so that hopefully the general plan makes sense and you can follow along and correct my mistakes or rewrite it your own way.

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA

Determining when order of a set of columns changes in pandas dataframe

I have a very large csv file with following structure:
a1 b1 c1 a2 b2 c2 a3 b3 c3 ..... a999 b999 c999
0 5 4 2 3 2 2 6 7 9 ....................
1 2 1 4 4 6 9 3 5 9 ....................
.
.
What I want to do is to group the columns in sets of N, for a, b and c, and check when the index of maximum value (argmax) of the set changes, in each row.
So in the above example, for N = 3, a1, b1, c1 is the first set in row 0, and argmax is 0, 2nd set is a2, b2, c2 and argmax is still 0, 3rd set is a3, b3, c3 but now the argmax is 2. I deally I am looking for a script that parses the whole csv file and returns [c3, c1]. c3 because thats where the argmax changes in row 0 and c1 becuase argmax doesn't change in row 1 but c1 is the largest value in that set.
I am doing this right now by using two for loops and its slow and looks very ugly, is there a better pandas pythonic way of doing this? I feel there must be.
I tried to keep to code as simple as possible. You can translate your dataframe and group by the sliced column name:
df = df.T.reset_index()
idx = df.groupby(df['index'].str.slice(1,2)).idxmax()
Output:
0 1
index
1 0 2
2 3 5
3 8 8
That means that for row 0 the max for group 1 is at index 0, the max group 2 is at index 3 (or 0 is you take the mod 3), the max for group 3 is at index 8, (or 2 if you take mod 3). Same reading for row 1 :)
If you need the actual column name:
df.columns[idx.values.flatten(order='F')]
Output:
['a1', 'a2', 'c3', 'c1', 'c2', 'c3']
You can groupby sets of columns and use .idxmax to find the column where the maximum occurs within each set. You can find where the first letter changes (if it ever does) to get your list.
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
mask = df2.applymap(lambda x: x[0]) # Case of 1-letter column prefix
## If possibility of words with different length ending in digits try
# import string
# mask = df2.applymap(lambda x: x.strip(string.digits))
df2.lookup(df2.index,
(mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
Sample Data
print(df)
a1 b1 c1 a2 b2 c2 a3 b3 c3
0 5 4 2 3 2 2 6 7 9
1 2 1 4 4 6 9 3 5 9
2 2 1 4 10 6 9 3 5 9
3 2 1 4 1 6 9 3 10 9
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
print(df2)
# 0 1 2
#0 a1 a2 c3
#1 c1 c2 c3
#2 c1 a2 c3
#3 c1 c2 b3
mask = df2.applymap(lambda x: x[0])
df2.lookup(df2.index, (mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
#['c3', 'c1', 'a2', 'b3']

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Pandas Multiindex Groupby aggregate column with value from another column

I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4
Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0

Categories

Resources