Pandas - Merge multiple columns and sum

Pandas - Merge multiple columns and sum - python

I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y

Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w

Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5

Related

How to calculate totals of all possible combinations of columns

I have the following df:
df = pd.DataFrame({'a': [1,2,3,4,2], 'b': [3,4,1,0,4], 'c':[1,2,3,1,0], 'd':[3,2,4,1,4]})
I want to generate a combination of totals from these 4 columns, which equals 4 x 3 x 2 = 24 total combinations minus duplicates. I want the results in the same df.
I want something that looks like this (partial results shown):
A combo of a_b is the same as b_a and therefore I wouldn't want such a calculation since its a duplicate.
Is there a way to calculate all combinations and exclude duplicate totals?

import itertools as it
orig_cols = df.columns
for r in range(2, df.shape[1] + 1):
for cols in it.combinations(orig_cols, r):
df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
Needs some looping, but not on the dataframe itself, but rather the combinations. We get 2, 3, ..., N-1'th combinations of the column names where N is number of columns. Then form the new _-joined column as the sum.
In [11]: df
Out[11]:
a b c d a_b a_c a_d b_c b_d c_d a_b_c a_b_d a_c_d b_c_d a_b_c_d
0 1 3 1 3 4 2 4 4 6 4 5 7 5 7 8
1 2 4 2 2 6 4 4 6 6 4 8 8 6 8 10
2 3 1 3 4 4 6 7 4 5 7 7 8 10 8 11
3 4 0 1 1 4 5 5 1 1 2 5 5 6 2 6
4 2 4 0 4 6 2 6 4 8 4 6 10 6 8 10

Pandas - replicate rows with new column value from a list for each replication

So I have a data frame that has two columns, State and Cost, and a separate list of new "what-if" costs
State Cost
A 2
B 9
C 8
D 4
New_Cost_List = [1, 5, 10]
I'd like to replicate all the rows in my data set for each value of New_Cost, adding a new column for each New_Cost for each state.
State Cost New_Cost
A 2 1
B 9 1
C 8 1
D 4 1
A 2 5
B 9 5
C 8 5
D 4 5
A 2 10
B 9 10
C 8 10
D 4 10
I thought a for loop might be appropriate to iterate through, replicating my dataset for the length of the list and adding the values of the list as a new column:
for v in New_Cost_List:
df_new = pd.DataFrame(np.repeat(df.values, len(New_Cost_List), axis=0))
df_new.columns = df.columns
df_new['New_Cost'] = v
The output of this gives me the correct replication of State and Cost but the New_Cost value is 10 for each row. Clearly I'm not connecting how to get it to run through the list for each replicated set, so any suggestions? Or is there a better way to approach this?
EDIT 1
Reducing the number of values in the New_Cost_List from 4 to 3 so there's a difference in row count and length of the list.

Here is a way using the keys paramater of pd.concat():
(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))
Output:
New_Cost State Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4

If i understand your question correctly, this should solve your problem.
df['New Cost'] = new_cost_list
df = pd.concat([df]*len(new_cost_list), ignore_index=True)
Output:
State Cost New Cost
0 A 2 1
1 B 9 5
2 C 8 10
3 D 4 15
4 A 2 1
5 B 9 5
6 C 8 10
7 D 4 15
8 A 2 1
9 B 9 5
10 C 8 10
11 D 4 15
12 A 2 1
13 B 9 5
14 C 8 10
15 D 4 15

You can use index.repeat and numpy.tile:
df2 = (df
.loc[df.index.repeat(len(New_Cost_List))]
.assign(**{'New_Cost': np.repeat(New_Cost_List, len(df))})
)
or, simply, with a cross merge:
df2 = df.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
output:
State Cost New_Cost
0 A 2 1
0 A 2 5
0 A 2 10
1 B 9 1
1 B 9 5
1 B 9 10
2 C 8 1
2 C 8 5
2 C 8 10
3 D 4 1
3 D 4 5
3 D 4 10
For the provided order:
(df
.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
.sort_values(by='New_Cost', kind='stable')
.reset_index(drop=True)
)
output:
State Cost New_Cost
0 A 2 1
1 B 9 1
2 C 8 1
3 D 4 1
4 A 2 5
5 B 9 5
6 C 8 5
7 D 4 5
8 A 2 10
9 B 9 10
10 C 8 10
11 D 4 10

how to join two dataframe by picking couple of column from each if one of the column has same data

there are two dataframes df_one and df_two I want to create a new data frame by with selective column from each of the dataframes
df_one
e b c d
1 2 3 4
5 6 7 8
6 2 4 8
9 2 5 6
and
df_two
e f g h
1 8 7 6
5 6 6 4
6 6 2 4
9 5 3 2
I want to create a new dataframe new_df
e b g h d
1 6 7 6 4
5 2 6 4 8
6 2 2 4 8
9 2 3 2 6
enter image description here

result = pd.merge(df_one, df_two, on='e')
result=result.loc[:,["e","b","g","h","d"]]

Use:
pd.merge(df1[["e", "b", "d"]], df2[["e", "g", "h"]], on="e")

How to groupby with nlargest and keep all columns?

I want to groupby DataFrame and get the nlargest data of column 'C'.
while the return is series, not DataFrame.
dftest = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).sort_index()
the result get a series.
2 1
4 2
5 2
6 3
7 3
8 4
9 4
Name: C, dtype: int64
I hope the result is DataFrame, like
A B C
2 3 A 1
4 5 A 2
5 6 B 2
6 7 A 3
7 8 B 3
8 9 B 4
9 10 B 4
******update**************
sorry, the column 'A' in fact does not series integers, the dftest might be more like
dftest = pd.DataFrame({'A':['Feb','Flow','Air','Flow','Feb','Beta','Cat','Feb','Beta','Air'],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
and the result should be
A B C
2 Air A 1
4 Feb A 2
5 Beta B 2
6 Cat A 3
7 Feb B 3
8 Beta B 4
9 Air B 4

It may be a bit clumsy, but it does what you asked:
dfn= dftest.groupby('B').apply(lambda
grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns=
{'level_1':'A'})
dfn.A = dfn.A+1
dfn=dfn[['A','B','C']].sort_values(by='A')

Thanks to my friends, the follow code works for me.
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp.nlargest(n=int(grp['C'].count()*0.8),columns='C').sort_index())
the dfn is
In [8]:dfn
Out[8]:
A B C
2 3 A 1
4 5 A 2
6 7 A 3
5 6 B 2
7 8 B 3
8 9 B 4
9 10 B 4
my previous code is deal with series, the later one is deal with DataFrame.

Adding duplicate rows to a DataFrame

I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)

Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Merge multiple columns and sum - python

Not the cleanest way but it might work. df_new = pd.DataFrame() df_new['A'] = df['A'] + df2['A'] df_new['B'] = df['B'] + df2['B'] df_new['C'] = df['C'] print(df_new) A B C 0 5 7 8 1 6 7 1 2 15 4 4 3 9 11 5

Related

How to calculate totals of all possible combinations of columns

Pandas - replicate rows with new column value from a list for each replication

how to join two dataframe by picking couple of column from each if one of the column has same data

How to groupby with nlargest and keep all columns?

Adding duplicate rows to a DataFrame

Categories

Resources