Joining dataframe whose columns have the same name - python

I would like to ask how to join (or merge) multiple dataframes (arbitrary number) whose columns may have the same name. I know this has been asked several times, but could not find a clear answer in any of the questions I have looked at.
import pickle
import os
from posixpath import join
import numpy as np
import pandas as pd
import re
import pickle
np.random.seed(1)
n_cols = 3
col_names = ["Ci"] + ["C"+str(i) for i in range(n_cols)]
def get_random_df():
values = np.random.randint(0, 10, size=(4,n_cols))
index = np.arange(4).reshape([4,-1])
return pd.DataFrame(np.concatenate([index, values], axis=1), columns=col_names).set_index("Ci")
dfs = []
for i in range(3):
dfs.append(get_random_df())
print(dfs[0])
print(dfs[1])
with output:
C0 C1 C2
Ci
0 5 8 9
1 5 0 0
2 1 7 6
3 9 2 4
C0 C1 C2
Ci
0 5 2 4
1 2 4 7
2 7 9 1
3 7 0 6
If I try and join two dataframes per iteration:
# concanenate two per iteration
df = dfs[0]
for df_ in dfs[1:]:
df = df.join(df_, how="outer", rsuffix="_r")
print("** 1 **")
print(df)
the final dataframe has columns with the same name: for example, C0_r is repeated for each joined dataframe.
** 1 **
C0 C1 C2 C0_r C1_r C2_r C0_r C1_r C2_r
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9
This could be easily solved by providing a different suffix per iteration. However, [the doc on join] says 1 " Efficiently join multiple DataFrame objects by index at once by passing a list.". If I try what follows:
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer")
# fails
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer", rsuffix="_r")
# fails
All steps fail due to duplicate columns:
Indexes have overlapping values: Index(['C0', 'C1', 'C2'], dtype='object')
Question: is there a way to join automatically multiple dataframes without explicitly providing a different suffix every time?

Instead of join, concatenate along columns
# concatenate along columns
# use keys to differentiate different dfs
res = pd.concat(dfs, keys=range(len(dfs)), axis=1)
# flatten column names
res.columns = [f"{j}_{i}" for i,j in res.columns]
res

Wouldn't be more readable to display your data like this?
By adding this line of code at the end:
pd.concat([x for x in dfs], axis=1, keys=[f'DF{str(i+1)}' for i in range(len(dfs))])
#output
DF1 DF2 DF3
C0 C1 C2 C0 C1 C2 C0 C1 C2
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9

Related

Combine 2 dataframes of different length to form a new dataframe that has a length equal to the max length of the 2 dataframes

I have a dataframe:
t = pd.Series([2,4,6,8,10,12],index= index)
df1 = pd.DataFrame(s,columns = ["MUL1"])
df1["MUL2"] =t
MUL1 MUL2
0 1 2
1 2 4
2 2 6
3 3 8
4 3 10
5 6 12
and another dataframe:
u = pd.Series([1,2,3,6],index= index)
v = pd.Series([2,8,10,12],index= index)
df2 = pd.DataFrame(u,columns = ["MUL3"])
df2["MUL4"] =v
Now I want a new dataframe which looks like the following:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12
By combining the first 2 dataframes.
I have tried the following:
X1 = df1.to_numpy()
X2 = df2.to_numpy()
list = []
for i in range(X1.shape[0]):
for j in range(X2.shape[0]):
if X1[i, -1] == X2[j, -1]:
list.append(X2[X1[i, -1]==X2[j, -1], -1])
I was trying to convert the dataframes to numpy arrays so I can iterate through them to get a new array that I can convert back to a dataframe. But the size of the new dataframe is not equal to size of the first dataframe. Please I would appreciate any help. Thanks.
Although the details of the logic are cryptic, I believe that you want a merge:
(df1[['MUL1']].rename(columns={'MUL1': 'MUL6'})
.merge(df2.rename(columns={'MUL3': 'MUL6', 'MUL4': 'MUL7'}),
on='MUL6', how='left')
)
output:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12

Is it possible to combine agg and value_counts in single line with Pandas

Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1

Pandas - How to swap column contents leaving label sequence intact?

I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)

Rename every 2 column with same from another data frame pandas

I have 2 sample data frames:
df1 =
a_1 b_1 a_2 b_2
1 2 3 4
5 6 7 8
and
df2 =
c
12
14
I want to add values of c as a suffix in order:
df3 =
12_a_1 12_b_1 14_a_2 14_b_2
1 2 3 4
5 6 7 8
one option is list comprehension:
import itertools
# use itertools to repeat values of df2
prefix = list(itertools.chain.from_iterable(itertools.repeat(str(x), 2) for x in df2['c'].values))
# list comprehension to create new column names
df1.columns = [p+'_'+c for c,p in zip(df1.columns, prefix)]
print(df1)
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8
Use str.split and map
s = (df1.columns.str.split('_').str[-1].astype(int) - 1).map(df2.c)
df1.columns = s.astype(str) + '_' + df1.columns
print(df1)
Out[304]:
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8

speeding up dataframe processing by concomitant filtering on pandas joins

I have two dataframes, one is pretty big the other is really huge.
df1: "classid"(text), "c1" (numeric), "c2"(numeric)
df2: "classid"(text), "c3" (numeric), "c4"(numeric)
I want to filter df2 based on values on df1. In pseudocode one would formulate it like this:
df2[(df2.classid == df1.classid) & (df2.c3 < df1.c1) & (df2.c4 < df1.c2)]
Right now I do this by iterating rows in df1 and doing some 40k filter calls on df2, which is a 3mil rows table. Obviously it works too slow.
df = dataframe()
for row in df1:
dft = df2[(df2.classid == row.classid) & (df2.c3 < row.c1) & (df2.c4 < row.c2)]
df.add(dft)
I guess the best option is to make an inner join and then the (df2.c3 < df1.c1) & (df2.c4 < df1.c2) filtering but the problem is that the inner join would create a huge table, since classid are not indexes and not unique row identifiers. If filtering could be applied concomitantly that might just work. Any ideas?
Iterating should be the last resort, I'd merge the other dfs columns c1 and c2 to df:
df = df.merge(df1, on='classid', how='left')
Then I would groupby the classid and then filter the rows like the following example:
In [95]:
df = pd.DataFrame({'classid':[0,0,1,1,1,2,2], 'c1':np.arange(7), 'c2':np.arange(7), 'c3':3, 'c4':4})
df
Out[95]:
c1 c2 c3 c4 classid
0 0 0 3 4 0
1 1 1 3 4 0
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2
In [100]:
df.groupby('classid').filter(lambda x: len( x[x['c3'] < x['c1']] & x[x['c4'] < x['c2']] ) > 0)
Out[100]:
c1 c2 c3 c4 classid
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2

Categories

Resources