I have a datafrmae "dfnan" that has NaN values and I need to replace those values with data from a different dataframe "dffill" with specific row insert position requirements by "name" and "month". My data looks like this for dfnan:
index result result result result result month name year
1 4 4 4 4 4 1 Bears 2022
2 20 20 20 20 20 2 Bears 2022
3 8 8 8 8 8 3 Bears 2022
4 5 5 5 5 5 4 Bears 2022
5 3 3 3 3 3 5 Bears 2022
6 19 19 19 19 19 6 Bears 2022
7 nan nan nan nan nan 7 Bears 2022
8 nan nan nan nan nan 8 Bears 2022
9 nan nan nan nan nan 9 Bears 2022
10 nan nan nan nan nan 10 Bears 2022
11 nan nan nan nan nan 11 Bears 2022
12 nan nan nan nan nan 12 Bears 2022
13 5 5 5 5 5 1 Eagles 2022
14 9 9 9 9 9 2 Eagles 2022
15 12 12 12 12 12 3 Eagles 2022
16 21 21 21 21 21 4 Eagles 2022
17 2 2 2 2 2 5 Eagles 2022
18 17 17 17 17 17 6 Eagles 2022
19 nan nan nan nan nan 7 Eagles 2022
20 nan nan nan nan nan 8 Eagles 2022
21 nan nan nan nan nan 9 Eagles 2022
22 nan nan nan nan nan 10 Eagles 2022
23 nan nan nan nan nan 11 Eagles 2022
24 nan nan nan nan nan 12 Eagles 2022
The data i need to fill the NaN values with is here in "dffill":
index month name 1 2 3 4 5
1 7 Bears 10 25 14 4 22
2 8 Bears 5 8 6 24 18
3 9 Bears 18 8 8 14 16
4 10 Bears 19 11 13 8 9
5 11 Bears 16 25 3 9 6
6 12 Bears 17 11 18 3 24
7 7 Eagles 15 24 11 2 25
8 8 Eagles 1 7 18 9 17
9 9 Eagles 11 11 8 18 20
10 10 Eagles 16 20 3 24 2
11 11 Eagles 10 24 6 4 19
12 12 Eagles 8 16 12 19 22
I am sorry but I cannot understand how to insert this row data in the correct position given the requirements of same "name" and "month". Thank you for your help and here is the final result.
index result result result result result month name year
1 4 4 4 4 4 1 Bears 2022
2 20 20 20 20 20 2 Bears 2022
3 8 8 8 8 8 3 Bears 2022
4 5 5 5 5 5 4 Bears 2022
5 3 3 3 3 3 5 Bears 2022
6 19 19 19 19 19 6 Bears 2022
7 10 25 14 4 22 7 Bears 2022
8 5 8 6 24 18 8 Bears 2022
9 18 8 8 14 16 9 Bears 2022
10 19 11 13 8 9 10 Bears 2022
11 16 25 3 9 6 11 Bears 2022
12 17 11 18 3 24 12 Bears 2022
13 5 5 5 5 5 1 Eagles 2022
14 9 9 9 9 9 2 Eagles 2022
15 12 12 12 12 12 3 Eagles 2022
16 21 21 21 21 21 4 Eagles 2022
17 2 2 2 2 2 5 Eagles 2022
18 17 17 17 17 17 6 Eagles 2022
19 15 24 11 2 25 7 Eagles 2022
20 1 7 18 9 17 8 Eagles 2022
21 11 11 8 18 20 9 Eagles 2022
22 16 20 3 24 2 10 Eagles 2022
23 10 24 6 4 19 11 Eagles 2022
24 8 16 12 19 22 12 Eagles 2022
here is one way to do it, which is to make use of df.update
assuming its ok to set the index and rename the df2 columns
#set the index on both the DF
df.set_index(['name','month'], inplace=True)
df2.set_index(['name','month'], inplace=True)
#match the columns names b/w df and df2, by taking df columns to rename df2
df2.columns=['result.' + str(int(col)-1) if str(col).isdigit() else col for col in df2.columns]
df2.rename(columns={'result.0':'result'}, inplace=True)
#use update to update the values
df.update(df2, overwrite=False)
#reset the index, in needed
df.reset_index()
name month index result result.1 result.2 result.3 result.4 year
0 Bears 1 1 4.0 4.0 4.0 4.0 4.0 2022
1 Bears 2 2 20.0 20.0 20.0 20.0 20.0 2022
2 Bears 3 3 8.0 8.0 8.0 8.0 8.0 2022
3 Bears 4 4 5.0 5.0 5.0 5.0 5.0 2022
4 Bears 5 5 3.0 3.0 3.0 3.0 3.0 2022
5 Bears 6 6 19.0 19.0 19.0 19.0 19.0 2022
6 Bears 7 7 10.0 25.0 14.0 4.0 22.0 2022
7 Bears 8 8 5.0 8.0 6.0 24.0 18.0 2022
8 Bears 9 9 18.0 8.0 8.0 14.0 16.0 2022
9 Bears 10 10 19.0 11.0 13.0 8.0 9.0 2022
10 Bears 11 11 16.0 25.0 3.0 9.0 6.0 2022
11 Bears 12 12 17.0 11.0 18.0 3.0 24.0 2022
12 Eagles 1 13 5.0 5.0 5.0 5.0 5.0 2022
13 Eagles 2 14 9.0 9.0 9.0 9.0 9.0 2022
14 Eagles 3 15 12.0 12.0 12.0 12.0 12.0 2022
15 Eagles 4 16 21.0 21.0 21.0 21.0 21.0 2022
16 Eagles 5 17 2.0 2.0 2.0 2.0 2.0 2022
17 Eagles 6 18 17.0 17.0 17.0 17.0 17.0 2022
18 Eagles 7 19 15.0 24.0 11.0 2.0 25.0 2022
19 Eagles 8 20 1.0 7.0 18.0 9.0 17.0 2022
20 Eagles 9 21 11.0 11.0 8.0 18.0 20.0 2022
21 Eagles 10 22 16.0 20.0 3.0 24.0 2.0 2022
22 Eagles 11 23 10.0 24.0 6.0 4.0 19.0 2022
23 Eagles 12 24 8.0 16.0 12.0 19.0 22.0 2022
Related
Given this dataframe:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
If I do order by column 2 using df.sort_values('2'), I get:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Is there a smart way to re-define the index column (from 0 to 11) preserving the new order I got?
Use reset_index:
df.sort_values('2').reset_index(drop=True)
Also (this will replace the original dataframe)
df[:] = df.sort_values('2').values
Year Week_Number DC_Zip Asin_code
1 2016 1 84105 NaN
2 2016 1 85034 NaN
3 2016 1 93711 NaN
4 2016 1 98433 NaN
5 2016 2 12206 21.0
6 2016 2 29306 10.0
7 2016 2 33426 11.0
8 2016 2 37206 1.0
9 2017 1 12206 266.0
10 2017 1 29306 81.0
11 2017 1 33426 NaN
12 2017 1 37206 NaN
13 2017 1 45216 99.0
14 2017 1 60160 100.0
15 2017 1 76110 76.0
16 2018 1 12206 562.0
17 2018 1 29306 184.0
18 2018 1 33426 NaN
19 2018 1 37206 NaN
20 2018 1 45216 187.0
21 2018 1 60160 192.0
22 2018 1 76110 202.0
23 2019 1 12206 511.0
24 2019 1 29306 NaN
25 2019 1 33426 224.0
26 2019 1 37206 78.0
27 2019 1 45216 160.0
28 2019 1 60160 NaN
29 2019 1 76110 221.0
30 2020 6 93711 NaN
31 2020 6 98433 NaN
32 2020 7 12206 74.0
33 2020 7 29306 22.0
34 2020 7 33426 32.0
35 2020 7 37206 10.0
36 2020 7 45216 34.0
I want to fill the NaN values with the Average of Asin_code for that particular year.I am able to fill the values for 2016 with this code
df["Asin_code"]=df.Asin_code.fillna(df.Asin_code.loc[(df.Year==2016)].mean(),axis=0)
But unable to do with the whole dataframe..
Use groupby().transform() and fillna:
df['Asin_code'] = df['Asin_code'].fillna(df.groupby('Year').Asin_code.transform('mean'))
Year Week_Number DC_Zip Asin_code
0 2016 1 12206 NaN
1 2016 1 29306 NaN
2 2016 1 33426 NaN
3 2016 1 37206 NaN
4 2016 1 45216 NaN
5 2016 1 60160 NaN
6 2016 1 76110 NaN
7 2016 1 80215 NaN
8 2016 1 84105 NaN
9 2016 1 85034 NaN
10 2016 1 93711 NaN
11 2016 1 98433 NaN
12 2016 2 12206 21.0
13 2016 2 29306 10.0
14 2016 2 33426 11.0
15 2016 2 37206 1.0
16 2016 2 45216 5.0
17 2016 2 60160 7.0
18 2016 2 76110 12.0
19 2016 2 80215 NaN
20 2016 2 84105 2.0
21 2016 2 85034 1.0
22 2016 2 93711 23.0
23 2016 2 98433 7.0
24 2016 3 12206 95.0
25 2016 3 29306 26.0
26 2016 3 33426 51.0
27 2016 3 37206 18.0
28 2016 3 45216 34.0
29 2016 3 60160 30.0
... ... ... ... ...
2778 2020 29 76110 33.0
2779 2020 29 80215 5.0
2780 2020 29 84105 3.0
2781 2020 29 85034 8.0
2782 2020 29 93711 53.0
2783 2020 29 98433 15.0
2784 2020 30 12206 75.0
2785 2020 30 29306 27.0
2786 2020 30 33426 34.0
2787 2020 30 37206 12.0
2788 2020 30 45216 14.0
2789 2020 30 60160 28.0
2790 2020 30 76110 47.0
2791 2020 30 80215 11.0
2792 2020 30 84105 3.0
2793 2020 30 85034 17.0
2794 2020 30 93711 62.0
2795 2020 30 98433 13.0
2796 2020 31 12206 109.0
2797 2020 31 29306 30.0
2798 2020 31 33426 31.0
2799 2020 31 37206 14.0
2800 2020 31 45216 23.0
2801 2020 31 60160 21.0
2802 2020 31 76110 25.0
2803 2020 31 80215 7.0
2804 2020 31 84105 4.0
2805 2020 31 85034 8.0
2806 2020 31 93711 71.0
2807 2020 31 98433 9.0
2808 rows × 4 columns
This is the sales data I am dealing with. I have to perform a weighted average on Asin_code with weighted rate = [5, 5, 20, 30, 40] on respective years 2016, 2017, 2018, 2019 and 2020. I have to create a function so that it will give me a column containing the weighted average of Asin_code."Nan" values should be dropped. We should also change the weighted rate in the future to view more patterns with the data. Any help would be appreciated.
i am trying the following code:
for i in range(len(df.Asin_code)):
df["Weighted_avg"]=rate[0]*df.Asin_code[i]/df.Asin_code.loc[(df.Year==2016)].sum()
just facing difficulties in consolidating the data for whole 5 years.
It becomes much simpler it you define your weights as a dict instead of a list then a simple use of apply() works
# define weights for year as a dict
wr = {2016:5, 2017:5, 2018:20, 2019:30, 2020:40}
df["Weighted_avg"] = df.apply(lambda r:
# numerator is weight * Asin_code[i]
( r["Asin_code"] * wr[r["Year"]]
/
# denomimator sum(Asin_code for year)
df.Asin_code.loc[(df.Year==r["Year"])].sum() ), axis=1)
output
Idx Year Week_Number DC_Zip Asin_code Weighted_avg
25 2016 3 29306 26.0 0.367232
26 2016 3 33426 51.0 0.720339
27 2016 3 37206 18.0 0.254237
28 2016 3 45216 34.0 0.480226
29 2016 3 60160 30.0 0.423729
2778 2020 29 76110 33.0 1.625616
2779 2020 29 80215 5.0 0.246305
2780 2020 29 84105 3.0 0.147783
2781 2020 29 85034 8.0 0.394089
2782 2020 29 93711 53.0 2.610837
suplementary update
Updated request: weighted_average[at index 1]=rate[for year 2016]*Asin_code[at first index of 2016]+rate[for year 2017]*Asin_code[at first index of 2017]+rate[for year 2018]*Asin_code[at first index of 2018]+rate[for year 2019]*Asin_code[at first index of 2019]+rate[for year 2020]*Asin_code[at first index of 2020]
df.dropna().groupby("Year").agg({"Asin_code":"first"}).reset_index()\
.assign(wa=lambda dfa:
dfa.apply(lambda r: r["Asin_code"]*wr[r['Year']],axis=1))["wa"].sum()
df["Weighted_avg"] = df.apply(lambda r: ( (r["Asin_code"] *wr[r["Year"]]).sum(axis = 0)), axis=1)
Output
12 2016 2 12206 21.0 105.0
13 2016 2 29306 10.0 50.0
14 2016 2 33426 11.0 55.0
15 2016 2 37206 1.0 5.0
16 2016 2 45216 5.0 25.0
17 2016 2 60160 7.0 35.0
18 2016 2 76110 12.0 60.0
19 2016 2 80215 NaN NaN
20 2016 2 84105 2.0 10.0
21 2016 2 85034 1.0 5.0
22 2016 2 93711 23.0 115.0
23 2016 2 98433 7.0 35.0
24 2016 3 12206 95.0 475.0
25 2016 3 29306 26.0 130.0
26 2016 3 33426 51.0 255.0
27 2016 3 37206 18.0 90.0
28 2016 3 45216 34.0 170.0
29 2016 3 60160 30.0 150.0
... ... ... ... ... ...
2778 2020 29 76110 33.0 1320.0
2779 2020 29 80215 5.0 200.0
2780 2020 29 84105 3.0 120.0
2781 2020 29 85034 8.0 320.0
2782 2020 29 93711 53.0 2120.0
2783 2020 29 98433 15.0 600.0
2784 2020 30 12206 75.0 3000.0
2785 2020 30 29306 27.0 1080.0
2786 2020 30 33426 34.0 1360.0
2787 2020 30 37206 12.0 480.0
2788 2020 30 45216 14.0 560.0
2789 2020 30 60160 28.0 1120.0
2790 2020 30 76110 47.0 1880.0
2791 2020 30 80215 11.0 440.0
2792 2020 30 84105 3.0 120.0
2793 2020 30 85034 17.0 680.0
2794 2020 30 93711 62.0 2480.0
2795 2020 30 98433 13.0 520.0
2796 2020 31 12206 109.0 4360.0
2797 2020 31 29306 30.0 1200.0
2798 2020 31 33426 31.0 1240.0
2799 2020 31 37206 14.0 560.0
2800 2020 31 45216 23.0 920.0
2801 2020 31 60160 21.0 840.0
2802 2020 31 76110 25.0 1000.0
2803 2020 31 80215 7.0 280.0
2804 2020 31 84105 4.0 160.0
2805 2020 31 85034 8.0 320.0
2806 2020 31 93711 71.0 2840.0
2807 2020 31 98433 9.0 360.0
Got my solution with this.
i have the following dataframe
id value year audit
1 21 2007 NaN
1 36 2008 2011
1 7 2009 Nan
2 44 2007 NaN
2 41 2008 Nan
2 15 2009 Nan
3 51 2007 NaN
3 15 2008 2011
3 51 2009 Nan
4 10 2007 NaN
4 12 2008 Nan
4 24 2009 2011
5 30 2007 2011
5 35 2008 Nan
5 122 2009 Nan
Basically, I want to create another variable audit2 where all the cells are 2011, if at least one audit is 2011, for each id.
I tried to put an if-statement inside a loop, but I cannot get any results
I would like to get this new dataframe
id value year audit audit2
1 21 2007 NaN 2011
1 36 2008 2011 2011
1 7 2009 Nan 2011
2 44 2007 NaN NaN
2 41 2008 Nan NaN
2 15 2009 Nan NaN
3 51 2007 NaN 2011
3 15 2008 2011 2011
3 51 2009 Nan 2011
4 10 2007 NaN 2011
4 12 2008 Nan 2011
4 24 2009 2011 2011
5 30 2007 2011 2011
5 35 2008 Nan 2011
5 122 2009 Nan 2011
Could you help me please?
df.groupby('id')['audit'].transform(lambda s: s[s.first_valid_index()] if s.first_valid_index() else np.nan)
output:
>>> df
0 2011.0
1 2011.0
2 2011.0
3 NaN
4 NaN
5 NaN
6 2011.0
7 2011.0
8 2011.0
9 2011.0
10 2011.0
11 2011.0
12 2011.0
13 2011.0
14 2011.0
Name: audit, dtype: float64
I have two data frames and need to group the first one based on some criteria from the second df.
df1=
summary participant_id response_date
0 2.0 11 2016-04-30
1 3.0 11 2016-05-01
2 3.0 11 2016-05-02
3 3.0 11 2016-05-03
4 3.0 11 2016-05-04
5 3.0 11 2016-05-05
6 3.0 11 2016-05-06
7 4.0 11 2016-05-07
8 4.0 11 2016-05-08
9 3.0 11 2016-05-09
10 3.0 11 2016-05-10
11 3.0 11 2016-05-11
12 3.0 11 2016-05-12
13 3.0 11 2016-05-13
14 3.0 11 2016-05-14
15 3.0 11 2016-05-15
16 3.0 11 2016-05-16
17 4.0 11 2016-05-17
18 3.0 11 2016-05-18
19 3.0 11 2016-05-19
20 3.0 11 2016-05-20
21 4.0 11 2016-05-21
22 4.0 11 2016-05-22
23 4.0 11 2016-05-23
24 3.0 11 2016-05-24
25 3.0 11 2016-05-25
26 3.0 11 2016-05-26
27 3.0 11 2016-05-27
28 3.0 11 2016-05-28
29 3.0 11 2016-05-29
.. ... ... ...
df2 =
summary participant_id response_date
0 12.0 11 2016-04-30
1 12.0 11 2016-05-14
2 14.0 11 2016-05-28
. ... ... ...
I need to group (get blocks) of df1 between the dates in the column of df2. Namely:
df1=
summary participant_id response_date
2.0 11 2016-04-30
3.0 11 2016-05-01
3.0 11 2016-05-02
3.0 11 2016-05-03
3.0 11 2016-05-04
3.0 11 2016-05-05
3.0 11 2016-05-06
4.0 11 2016-05-07
4.0 11 2016-05-08
3.0 11 2016-05-09
3.0 11 2016-05-10
3.0 11 2016-05-11
3.0 11 2016-05-12
3.0 11 2016-05-13
3.0 11 2016-05-14
3.0 11 2016-05-15
3.0 11 2016-05-16
4.0 11 2016-05-17
3.0 11 2016-05-18
3.0 11 2016-05-19
3.0 11 2016-05-20
4.0 11 2016-05-21
4.0 11 2016-05-22
4.0 11 2016-05-23
3.0 11 2016-05-24
3.0 11 2016-05-25
3.0 11 2016-05-26
3.0 11 2016-05-27
3.0 11 2016-05-28
3.0 11 2016-05-29
.. ... ... ...
Is there an elegant solution with groupby?
There might be a more elegant solution but you can loop through the response_date values in df2 and create a boolean series of values by checking against the all the response_date values in df1 and simply summing them all up.
df1['group'] = 0
for rd in df2.response_date.values:
df1['group'] += df1.response_date > rd
Output:
summary participant_id response_date group
0 2.0 11 2016-04-30 0
1 3.0 11 2016-05-01 1
2 3.0 11 2016-05-02 1
3 3.0 11 2016-05-03 1
4 3.0 11 2016-05-04 1
Building off of #Scott's answer:
You can use pd.cut but you will need to add a date before the earliest date and after the latest date in response_date from df2
dates = [pd.Timestamp('2000-1-1')] +
df2.response_date.sort_values().tolist() +
[pd.Timestamp('2020-1-1')]
df1['group'] = pd.cut(df1['response_date'], dates)
You want the .cut method. This lets you bin your dates by some other list of dates.
df1['cuts'] = pd.cut(df1['response_date'], df2['response_date'])
grouped = df1.groupby('cuts')
print grouped.max() #for example