I want to write a python code and have for example 2 different DataFrames (the number of dataframes can be more than 2) as follows:
df1 =
Index Name Age Height
0 Tom 20 166
1 Bill 27 170
2 Jacob 39 180
3 Vivian 26 155
df2 =
Index Name Age Height
0 Mary 20 166
1 Tom 27 170
2 Bill 39 180
3 Jack 26 155
I want to check the names in both the dataframes and if they match add corresponding entries in the columns so that the final result looks like a third dataframe:
result =
Index Name Age Height
0 Tom 47 336
1 Bill 66 350
2 Jacob 39 180
3 Vivian 26 155
4 Mary 20 166
5 Jack 26 155
Tom and Bill have 2 entries in 2 dataframes, so their Age and Height get added and others have a single entry, so the original number is displayed. Thank you in advance.
You can concatenate using pd.concat both the DataFrames then GroupBy name using DataFrame.groupby.
# Assuming `Index` is not a column. If it's a column
# set it as index using `df.set_index("Index")
out = (
pd.concat([df1, df2], ignore_index=True)
.groupby("Name", as_index=False, sort=False)
.sum()
)
out
# Name Age Height
# 0 Tom 47 336
# 1 Bill 66 350
# 2 Jacob 39 180
# 3 Vivian 26 155
# 4 Mary 20 166
# 5 Jack 26 155
I want to group a dataframe by a column then apply a cumsum over the other ordered by the first column descending
df1:
id PRICE DEMAND
0 120 10
1 232 2
2 120 3
3 232 8
4 323 5
5 323 6
6 323 2
df2:
id PRICE DEMAND
0 323 13
1 232 23
2 120 36
I do it in two instructions but I am feeling it can be done with only one sum
data = data.groupby('PRICE',as_index=False).agg({'DEMAND': 'sum'}).sort_values(by='PRICE', ascending=False)
data['DEMAND'] = data['DEMAND'].cumsum()
What you have seems perfectly fine to me. But if you want to chain everything together, first sort then groupby with sort=False so it doesn't change the order. Then you can sum within group and cumsum the resulting Series
(df.sort_values('PRICE', ascending=False)
.groupby('PRICE', sort=False)['DEMAND'].sum()
.cumsum()
.reset_index())
PRICE DEMAND
0 323 13
1 232 23
2 120 36
Another option would be to sort then cumsum and then drop_duplicates:
(df.sort_values('PRICE', ascending=False)
.set_index('PRICE')
.DEMAND.cumsum()
.reset_index()
.drop_duplicates('PRICE', keep='last'))
PRICE DEMAND
2 323 13
4 232 23
6 120 36
I have seen a lot of SUMIFS question being answered here but is very different from the one I need.
1st Trade data frame contains transaction id and C_ID
transaction C_ID
1 101
2 103
3 104
4 101
5 102
6 104
2nd Customer data frame contains C_ID, On/Off, Amount
C_ID On/Off Amount
102 On 320
101 On 400
101 On 200
103 On 60
104 Off 80
104 On 100
So i want to calculate the Amount based on the C_ID with a condition on column 'On/Off' in Customer data frame. The resulting trade data frame should be
transaction C_ID Amount
1 101 600
2 103 60
3 104 100
4 101 600
5 102 320
6 104 100
So here’s the formula in EXCEL on how Amount are calculated
=SUMIFS(Customer.Amount, Customer.C_ID = Trade.C_ID, Customer.On/Off = On)
So i want to replicate this particular formula in Python using Pandas
You can use groupby() on filtered data to compute the sum and map to assign new column to transaction data.
s = df2[df2['On/Off']=='On'].groupby('C_ID')['Amount'].sum()
df1['Amount'] = df1['C_ID'].map(s)
We do filter groupby + reindex assign
df1['Amount']=df2.loc[df2['On/Off']=='On'].groupby(['C_ID']).Amount.sum().reindex(df1.C_ID).tolist()
df1
Out[340]:
transaction C_ID Amount
0 1 101 600
1 2 103 60
2 3 104 100
3 4 101 600
4 5 102 320
5 6 104 100
I have a dataframe that I want to group using multiple columns and then add a calculated column (mean) based on the grouping. Can someone give me a hand?
I have tried the grouping and it works fine, but adding the calculated (rolling mean) column is proving to be a hustle
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], list('AAAAAAAABBBBBBBB'), ['RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW'], ['1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'],[100,112,99,120,105,114,100,150,200,134,167,150,134,189,172,179]]).T
df.columns = ['id','Station','Train','month_code','total']
df2 = df.groupby(['Station','Train','month_code','total']).size().reset_index().groupby(['Station','Train','month_code'])['total'].max()
Looking at getting an outcome similar to this below
Station Train month_code total average
A BLUE 1 112
2 114 113
GREEN 1 99 106.5
2 100 99.5
RED 1 100 100
2 105 102.5
YELLOW 1 120 112.5
2 150 135
B BLUE 1 134 142
2 189 161.5
GREEN 1 167 178
2 172 169.5
RED 1 200 186
2 134 167
YELLOW 1 150 142
2 179 164.5
How about you change your initial groupby to keep the column name 'total'.
df3 = df.groupby(['Station','Train','month_code']).sum()
>>> df3.head()
id total
Station Train month_code
A BLUE 1 2 112
2 6 114
GREEN 1 3 99
2 7 100
RED 1 1 100
Then do a rolling mean on the total column.
df3['average'] = df3['total'].rolling(2).mean()
>>> df3.head()
id total average
Station Train month_code
A BLUE 1 2 112 NaN
2 6 114 113.0
GREEN 1 3 99 106.5
2 7 100 99.5
RED 1 1 100 100.0
You can then still remove the id column if you don't want it.
I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])