I have 2 dataframes, i want to get sum value of every row based on groupby of unique id each previous 3rows & each row value should be multiply by other dataframe value
for example : dataframe A dataframe B
unique_id value out_value num_values
1 1 45 0.15
2 1 33 0.30
3 1 18 0.18
#4 1 26 20.7
5 2 66
6 2 44
7 2 22
#8 2 19. 28.3
expected output_value column
4th row = 18 * 0.15 + 33*0.30 + 45*0.18 = 2.7+9.9+8.1 = 20.7
8th row = 22 * 0.15 + 44*0.30 + 66*0.18 = 3.3+ 13.2 + 11.88= 28.3
based on Unique_id each value should calculate based previous 3values.
for every row there will be previous 3 rows available
import pandas as pd
import numpy as np
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1, 2, 2, 2, 2, 152, 152, 152, 152, 152],
'value':[45,33,18,26,66,44,22,19,36,27,45,81,90]
}, index=range(1,14))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id value
1 1 45
2 1 33
3 1 18
4 1 26
5 2 66
6 2 44
7 2 22
8 2 19
9 152 36
10 152 27
11 152 45
12 152 81
13 152 90
df_b
###
num_values
0 0.15
1 0.30
2 0.18
# main calculation
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby('uni_id').cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id value out_value
1 1 45 NaN
2 1 33 NaN
3 1 18 NaN
4 1 26 20.70
5 2 66 NaN
6 2 44 NaN
7 2 22 NaN
8 2 19 28.38
9 152 36 NaN
10 152 27 NaN
11 152 45 NaN
12 152 81 21.33
13 152 90 30.51
If you want to keep non-null values through filtrate:
df_a.query('out_value.notnull()')
###
uni_id value out_value
4 1 26 20.70
8 2 19 28.38
12 152 81 21.33
13 152 90 30.51
Group with metrics uni_id,Year_Month
Data preparation:
# create date range series with 7 days
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
rng.integers(10,100, 26)
date_range = pd.Series(pd.date_range(start='01.30.2020', periods=27, freq='5D')).dt.to_period('M')
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1,1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 152, 152, 152, 152, 152,152, 152, 152, 152, 152],
'Year_Month':date_range,
'value':rng.integers(10,100, 26)
}, index=range(1,27))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id Year_Month value
1 1 2020-02 46
2 1 2020-02 84
3 1 2020-02 59
4 1 2020-02 49
5 1 2020-02 50
6 1 2020-02 30
7 1 2020-03 18
8 1 2020-03 59
9 2 2020-03 89
10 2 2020-03 15
11 2 2020-03 87
12 2 2020-03 84
13 2 2020-04 34
14 2 2020-04 66
15 2 2020-04 24
16 2 2020-04 78
17 152 2020-04 73
18 152 2020-04 41
19 152 2020-05 16
20 152 2020-05 97
21 152 2020-05 50
22 152 2020-05 90
23 152 2020-05 71
24 152 2020-05 80
25 152 2020-06 78
26 152 2020-06 27
Processing
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby(['uni_id','Year_Month']).cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id Year_Month value out_value
1 1 2020-02 46 NaN
2 1 2020-02 84 NaN
3 1 2020-02 59 NaN
4 1 2020-02 49 40.17
5 1 2020-02 50 32.82
6 1 2020-02 30 28.32
7 1 2020-03 18 NaN
8 1 2020-03 59 NaN
9 2 2020-03 89 NaN
10 2 2020-03 15 NaN
11 2 2020-03 87 NaN
12 2 2020-03 84 41.4
13 2 2020-04 34 NaN
14 2 2020-04 66 NaN
15 2 2020-04 24 NaN
16 2 2020-04 78 30.78
17 152 2020-04 73 NaN
18 152 2020-04 41 NaN
19 152 2020-05 16 NaN
20 152 2020-05 97 NaN
21 152 2020-05 50 NaN
22 152 2020-05 90 45.96
23 152 2020-05 71 46.65
24 152 2020-05 80 49.5
25 152 2020-06 78 NaN
26 152 2020-06 27 NaN
My data is related to "Cricket", sports game (like Baseball). It has 20 overs for each inning max and each over has approx 6 balls.
data:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 0
33 2008 60 1 61 1 5.2 0
34 2008 60 1 61 1 5.3 0
35 2008 60 1 61 1 5.4 0
36 2008 60 1 61 1 5.5 0
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 0
179074 2019 11415 2 154 5 19.3 0
179075 2019 11415 2 155 6 19.4 0
179076 2019 11415 2 157 6 19.5 0
179077 2019 11415 2 157 7 19.6 0
111972 rows × 7 columns
innings_score is new column created by me (given default value 0). I want to update it.
The values that I want to enter in it are the results of df.groupby below.
In[]:
df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()
Out[]:
season match_id inning
2008 60 1 222
2 82
61 1 240
2 207
62 1 129
...
2019 11413 2 170
11414 1 155
2 162
11415 1 152
2 157
Name: sum_total_runs, Length: 1276, dtype: int64
I want innings_score to be like:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 222
33 2008 60 1 61 1 5.2 222
34 2008 60 1 61 1 5.3 222
35 2008 60 1 61 1 5.4 222
36 2008 60 1 61 1 5.5 222
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 157
179074 2019 11415 2 154 5 19.3 157
179075 2019 11415 2 155 6 19.4 157
179076 2019 11415 2 157 6 19.5 157
179077 2019 11415 2 157 7 19.6 157
111972 rows × 7 columns
I would use assign. Starting from a simple example:
import pandas as pd
dt = pd.DataFrame({"name1":["A", "A", "B", "B", "C", "C"], "name2":["C", "C", "C", "D", "D", "D"], "value":[1, 2, 3, 4, 5, 6]})
grouping_variables = ["name1", "name2"]
dt = dt.set_index(grouping_variables)
dt = dt.assign(new_column=dt.groupby(grouping_variables)["value"].max())
As you can see, you set your grouping_variables as indeces before running the assignment.
You can always reset the index at the end if you don't want to keep the grouping_variables indexed dataframe:
dt.reset_index()
One way is to set those 3 columns as index and assign the groupby result as a new column and reset index after that.
While those columns are index, the grouby result and the dataframe both have similar index, so pandas will automatically match and insert the correct rows in the correct positions. Then reset index will turn them back into normal columns.
Something like this:
In [46]: df
Out[46]:
season match_id inning sum_total_runs sum_total_wickets over/ball
0 2008 60 1 61 0 5.1
1 2008 60 1 61 1 5.2
2 2008 60 1 61 1 5.3
3 2008 60 1 61 1 5.4
4 2008 60 1 61 1 5.5
5 2019 11415 2 152 5 19.2
6 2019 11415 2 154 5 19.3
7 2019 11415 2 155 6 19.4
8 2019 11415 2 157 6 19.5
9 2019 11415 2 157 7 19.6
In [47]: df.set_index(['season', 'match_id', 'inning']).assign(innings_score=df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()).reset_index()
Out[47]:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
0 2008 60 1 61 0 5.1 61
1 2008 60 1 61 1 5.2 61
2 2008 60 1 61 1 5.3 61
3 2008 60 1 61 1 5.4 61
4 2008 60 1 61 1 5.5 61
5 2019 11415 2 152 5 19.2 157
6 2019 11415 2 154 5 19.3 157
7 2019 11415 2 155 6 19.4 157
8 2019 11415 2 157 6 19.5 157
9 2019 11415 2 157 7 19.6 157
I have a Data frame like this
Temp_in_C Temp_in_F Date Year Month Day
23 65 2011-12-12 2011 12 12
12 72 2011-12-12 2011 12 12
NaN 67 2011-12-12 2011 12 12
0 0 2011-12-12 2011 12 12
7 55 2011-12-13 2011 12 13
I am trying to get output in this format (The NaN and zero values of pertuculer day is replaced by avg temp of that day only)
Output will be
Temp_in_C Temp_in_F Date Year Month Day
23 65 2011-12-12 2011 12 12
12 72 2011-12-12 2011 12 12
17.5 67 2011-12-12 2011 12 12
17.5 68 2011-12-12 2011 12 12
7 55 2011-12-13 2011 12 13
These vales will be replaced by mean of that perticuler day. I am trying to do this
temp_df = csv_data_df[csv_data_df["Temp_in_C"]!=0]
temp_df["Temp_in_C"] =
temp_df["Temp_in_C"].replace('*',np.nan)
x=temp_df["Temp_in_C"].mean()
csv_data_df["Temp_in_C"]=csv_data_df["Temp_in_C"]
.replace(0.0,x)
csv_data_df["Temp_in_C"]=csv_data_df["Temp_in_C"]
.fillna(x)
This code is taking the mean of whole columns and replacing it directly.
How can i group by day and take mean and then replace values for that particular day only.
First, replace zeros with NaN
df = df.replace(0,np.nan)
Then fill the missing values using transform (see this post)
df.groupby('Date').transform(lambda x: x.fillna(x.mean()))
Gives:
Temp_in_C Temp_in_F Year Month Day
0 23.0 65.0 2011 12 12
1 12.0 72.0 2011 12 12
2 17.5 67.0 2011 12 12
3 17.5 68.0 2011 12 12
4 7.0 55.0 2011 12 13
I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270
I have this dataframe
df1_9
date store_nbr item_nbr units station_nbr tavg preciptotal
8 2012-01-01 1 9 29 1 42 0.05
119 2012-01-02 1 9 60 1 41 0.01
...
452 2012-01-05 1 9 16 1 32 0.00
563 2012-01-06 1 9 12 1 36 T
I want to replace the 'T' in the preciptotal column with the value .01.
df1_9.ix[df1_9.preciptotal == 'T', 'preciptotal'] = 0.01
I wrote this code, but for some reason it is not working. I have been staring at this for a while, any help would be appreciated.