I have 2 sample data frames:
df1 =
a_1 b_1 a_2 b_2
1 2 3 4
5 6 7 8
and
df2 =
c
12
14
I want to add values of c as a suffix in order:
df3 =
12_a_1 12_b_1 14_a_2 14_b_2
1 2 3 4
5 6 7 8
one option is list comprehension:
import itertools
# use itertools to repeat values of df2
prefix = list(itertools.chain.from_iterable(itertools.repeat(str(x), 2) for x in df2['c'].values))
# list comprehension to create new column names
df1.columns = [p+'_'+c for c,p in zip(df1.columns, prefix)]
print(df1)
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8
Use str.split and map
s = (df1.columns.str.split('_').str[-1].astype(int) - 1).map(df2.c)
df1.columns = s.astype(str) + '_' + df1.columns
print(df1)
Out[304]:
12_a_1 12_b_1 14_a_2 14_b_2
0 1 2 3 4
1 5 6 7 8
Related
I would like to ask how to join (or merge) multiple dataframes (arbitrary number) whose columns may have the same name. I know this has been asked several times, but could not find a clear answer in any of the questions I have looked at.
import pickle
import os
from posixpath import join
import numpy as np
import pandas as pd
import re
import pickle
np.random.seed(1)
n_cols = 3
col_names = ["Ci"] + ["C"+str(i) for i in range(n_cols)]
def get_random_df():
values = np.random.randint(0, 10, size=(4,n_cols))
index = np.arange(4).reshape([4,-1])
return pd.DataFrame(np.concatenate([index, values], axis=1), columns=col_names).set_index("Ci")
dfs = []
for i in range(3):
dfs.append(get_random_df())
print(dfs[0])
print(dfs[1])
with output:
C0 C1 C2
Ci
0 5 8 9
1 5 0 0
2 1 7 6
3 9 2 4
C0 C1 C2
Ci
0 5 2 4
1 2 4 7
2 7 9 1
3 7 0 6
If I try and join two dataframes per iteration:
# concanenate two per iteration
df = dfs[0]
for df_ in dfs[1:]:
df = df.join(df_, how="outer", rsuffix="_r")
print("** 1 **")
print(df)
the final dataframe has columns with the same name: for example, C0_r is repeated for each joined dataframe.
** 1 **
C0 C1 C2 C0_r C1_r C2_r C0_r C1_r C2_r
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9
This could be easily solved by providing a different suffix per iteration. However, [the doc on join] says 1 " Efficiently join multiple DataFrame objects by index at once by passing a list.". If I try what follows:
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer")
# fails
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer", rsuffix="_r")
# fails
All steps fail due to duplicate columns:
Indexes have overlapping values: Index(['C0', 'C1', 'C2'], dtype='object')
Question: is there a way to join automatically multiple dataframes without explicitly providing a different suffix every time?
Instead of join, concatenate along columns
# concatenate along columns
# use keys to differentiate different dfs
res = pd.concat(dfs, keys=range(len(dfs)), axis=1)
# flatten column names
res.columns = [f"{j}_{i}" for i,j in res.columns]
res
Wouldn't be more readable to display your data like this?
By adding this line of code at the end:
pd.concat([x for x in dfs], axis=1, keys=[f'DF{str(i+1)}' for i in range(len(dfs))])
#output
DF1 DF2 DF3
C0 C1 C2 C0 C1 C2 C0 C1 C2
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0
I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)
I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns