Missing replacemente by a loop - python

i have the following dataframe
id value year audit
1 21 2007 NaN
1 36 2008 2011
1 7 2009 Nan
2 44 2007 NaN
2 41 2008 Nan
2 15 2009 Nan
3 51 2007 NaN
3 15 2008 2011
3 51 2009 Nan
4 10 2007 NaN
4 12 2008 Nan
4 24 2009 2011
5 30 2007 2011
5 35 2008 Nan
5 122 2009 Nan
Basically, I want to create another variable audit2 where all the cells are 2011, if at least one audit is 2011, for each id.
I tried to put an if-statement inside a loop, but I cannot get any results
I would like to get this new dataframe
id value year audit audit2
1 21 2007 NaN 2011
1 36 2008 2011 2011
1 7 2009 Nan 2011
2 44 2007 NaN NaN
2 41 2008 Nan NaN
2 15 2009 Nan NaN
3 51 2007 NaN 2011
3 15 2008 2011 2011
3 51 2009 Nan 2011
4 10 2007 NaN 2011
4 12 2008 Nan 2011
4 24 2009 2011 2011
5 30 2007 2011 2011
5 35 2008 Nan 2011
5 122 2009 Nan 2011
Could you help me please?

df.groupby('id')['audit'].transform(lambda s: s[s.first_valid_index()] if s.first_valid_index() else np.nan)
output:
>>> df
0 2011.0
1 2011.0
2 2011.0
3 NaN
4 NaN
5 NaN
6 2011.0
7 2011.0
8 2011.0
9 2011.0
10 2011.0
11 2011.0
12 2011.0
13 2011.0
14 2011.0
Name: audit, dtype: float64

Related

Replace Pandas NaN with Data from Different DF Based on Multiple Conditions

I have a datafrmae "dfnan" that has NaN values and I need to replace those values with data from a different dataframe "dffill" with specific row insert position requirements by "name" and "month". My data looks like this for dfnan:
index result result result result result month name year
1 4 4 4 4 4 1 Bears 2022
2 20 20 20 20 20 2 Bears 2022
3 8 8 8 8 8 3 Bears 2022
4 5 5 5 5 5 4 Bears 2022
5 3 3 3 3 3 5 Bears 2022
6 19 19 19 19 19 6 Bears 2022
7 nan nan nan nan nan 7 Bears 2022
8 nan nan nan nan nan 8 Bears 2022
9 nan nan nan nan nan 9 Bears 2022
10 nan nan nan nan nan 10 Bears 2022
11 nan nan nan nan nan 11 Bears 2022
12 nan nan nan nan nan 12 Bears 2022
13 5 5 5 5 5 1 Eagles 2022
14 9 9 9 9 9 2 Eagles 2022
15 12 12 12 12 12 3 Eagles 2022
16 21 21 21 21 21 4 Eagles 2022
17 2 2 2 2 2 5 Eagles 2022
18 17 17 17 17 17 6 Eagles 2022
19 nan nan nan nan nan 7 Eagles 2022
20 nan nan nan nan nan 8 Eagles 2022
21 nan nan nan nan nan 9 Eagles 2022
22 nan nan nan nan nan 10 Eagles 2022
23 nan nan nan nan nan 11 Eagles 2022
24 nan nan nan nan nan 12 Eagles 2022
The data i need to fill the NaN values with is here in "dffill":
index month name 1 2 3 4 5
1 7 Bears 10 25 14 4 22
2 8 Bears 5 8 6 24 18
3 9 Bears 18 8 8 14 16
4 10 Bears 19 11 13 8 9
5 11 Bears 16 25 3 9 6
6 12 Bears 17 11 18 3 24
7 7 Eagles 15 24 11 2 25
8 8 Eagles 1 7 18 9 17
9 9 Eagles 11 11 8 18 20
10 10 Eagles 16 20 3 24 2
11 11 Eagles 10 24 6 4 19
12 12 Eagles 8 16 12 19 22
I am sorry but I cannot understand how to insert this row data in the correct position given the requirements of same "name" and "month". Thank you for your help and here is the final result.
index result result result result result month name year
1 4 4 4 4 4 1 Bears 2022
2 20 20 20 20 20 2 Bears 2022
3 8 8 8 8 8 3 Bears 2022
4 5 5 5 5 5 4 Bears 2022
5 3 3 3 3 3 5 Bears 2022
6 19 19 19 19 19 6 Bears 2022
7 10 25 14 4 22 7 Bears 2022
8 5 8 6 24 18 8 Bears 2022
9 18 8 8 14 16 9 Bears 2022
10 19 11 13 8 9 10 Bears 2022
11 16 25 3 9 6 11 Bears 2022
12 17 11 18 3 24 12 Bears 2022
13 5 5 5 5 5 1 Eagles 2022
14 9 9 9 9 9 2 Eagles 2022
15 12 12 12 12 12 3 Eagles 2022
16 21 21 21 21 21 4 Eagles 2022
17 2 2 2 2 2 5 Eagles 2022
18 17 17 17 17 17 6 Eagles 2022
19 15 24 11 2 25 7 Eagles 2022
20 1 7 18 9 17 8 Eagles 2022
21 11 11 8 18 20 9 Eagles 2022
22 16 20 3 24 2 10 Eagles 2022
23 10 24 6 4 19 11 Eagles 2022
24 8 16 12 19 22 12 Eagles 2022
here is one way to do it, which is to make use of df.update
assuming its ok to set the index and rename the df2 columns
#set the index on both the DF
df.set_index(['name','month'], inplace=True)
df2.set_index(['name','month'], inplace=True)
#match the columns names b/w df and df2, by taking df columns to rename df2
df2.columns=['result.' + str(int(col)-1) if str(col).isdigit() else col for col in df2.columns]
df2.rename(columns={'result.0':'result'}, inplace=True)
#use update to update the values
df.update(df2, overwrite=False)
#reset the index, in needed
df.reset_index()
name month index result result.1 result.2 result.3 result.4 year
0 Bears 1 1 4.0 4.0 4.0 4.0 4.0 2022
1 Bears 2 2 20.0 20.0 20.0 20.0 20.0 2022
2 Bears 3 3 8.0 8.0 8.0 8.0 8.0 2022
3 Bears 4 4 5.0 5.0 5.0 5.0 5.0 2022
4 Bears 5 5 3.0 3.0 3.0 3.0 3.0 2022
5 Bears 6 6 19.0 19.0 19.0 19.0 19.0 2022
6 Bears 7 7 10.0 25.0 14.0 4.0 22.0 2022
7 Bears 8 8 5.0 8.0 6.0 24.0 18.0 2022
8 Bears 9 9 18.0 8.0 8.0 14.0 16.0 2022
9 Bears 10 10 19.0 11.0 13.0 8.0 9.0 2022
10 Bears 11 11 16.0 25.0 3.0 9.0 6.0 2022
11 Bears 12 12 17.0 11.0 18.0 3.0 24.0 2022
12 Eagles 1 13 5.0 5.0 5.0 5.0 5.0 2022
13 Eagles 2 14 9.0 9.0 9.0 9.0 9.0 2022
14 Eagles 3 15 12.0 12.0 12.0 12.0 12.0 2022
15 Eagles 4 16 21.0 21.0 21.0 21.0 21.0 2022
16 Eagles 5 17 2.0 2.0 2.0 2.0 2.0 2022
17 Eagles 6 18 17.0 17.0 17.0 17.0 17.0 2022
18 Eagles 7 19 15.0 24.0 11.0 2.0 25.0 2022
19 Eagles 8 20 1.0 7.0 18.0 9.0 17.0 2022
20 Eagles 9 21 11.0 11.0 8.0 18.0 20.0 2022
21 Eagles 10 22 16.0 20.0 3.0 24.0 2.0 2022
22 Eagles 11 23 10.0 24.0 6.0 4.0 19.0 2022
23 Eagles 12 24 8.0 16.0 12.0 19.0 22.0 2022

Get "Last Purchase Year" from Sales Data Pivot in Pandas

I have pivoted the Customer ID against their year of purchase, so that I know how many times each customer purchased in different years:
Customer ID 1996 1997 ... 2019 2020
100000000000001 7 7 ... NaN NaN
100000000000002 8 8 ... NaN NaN
100000000000003 7 4 ... NaN NaN
100000000000004 NaN NaN ... 21 24
100000000000005 17 11 ... 18 NaN
My desired result is to append the column names with the latest year of purchase, and thus the number of years since their last purchase:
Customer ID 1996 1997 ... 2019 2020 Last Recency
100000000000001 7 7 ... NaN NaN 1997 23
100000000000002 8 8 ... NaN NaN 1997 23
100000000000003 7 4 ... NaN NaN 1997 23
100000000000004 NaN NaN ... 21 24 2020 0
100000000000005 17 11 ... 18 NaN 2019 1
Here is what I tried:
df_pivot["Last"] = 2020
k = 2020
while math.isnan(df_pivot2[k]):
df_pivot["Last"] = k-1
k = k-1
df_pivot["Recency"] = 2020 - df_pivot["Last"]
However what I got is "TypeError: cannot convert the series to <class 'float'>"
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
You can get last year of purchase using notna + cumsum and idxmax along axis=1 then subtract this last year of purchase from the max year to compute Recency:
c = df.filter(regex=r'\d+').columns
df['Last'] = df[c].notna().cumsum(1).idxmax(1)
df['Recency'] = c.max() - df['Last']
Customer ID 1996 1997 2019 2020 Last Recency
0 100000000000001 7.0 7.0 NaN NaN 1997 23
1 100000000000002 8.0 8.0 NaN NaN 1997 23
2 100000000000003 7.0 4.0 NaN NaN 1997 23
3 100000000000004 NaN NaN 21.0 24.0 2020 0
4 100000000000005 17.0 11.0 18.0 NaN 2019 1
one idea is to apply "applymap(float)" to your dataFrame
Documentation from pandas

Filling of NaN values with the average of Quantity corresponding to a particular year

Year Week_Number DC_Zip Asin_code
1 2016 1 84105 NaN
2 2016 1 85034 NaN
3 2016 1 93711 NaN
4 2016 1 98433 NaN
5 2016 2 12206 21.0
6 2016 2 29306 10.0
7 2016 2 33426 11.0
8 2016 2 37206 1.0
9 2017 1 12206 266.0
10 2017 1 29306 81.0
11 2017 1 33426 NaN
12 2017 1 37206 NaN
13 2017 1 45216 99.0
14 2017 1 60160 100.0
15 2017 1 76110 76.0
16 2018 1 12206 562.0
17 2018 1 29306 184.0
18 2018 1 33426 NaN
19 2018 1 37206 NaN
20 2018 1 45216 187.0
21 2018 1 60160 192.0
22 2018 1 76110 202.0
23 2019 1 12206 511.0
24 2019 1 29306 NaN
25 2019 1 33426 224.0
26 2019 1 37206 78.0
27 2019 1 45216 160.0
28 2019 1 60160 NaN
29 2019 1 76110 221.0
30 2020 6 93711 NaN
31 2020 6 98433 NaN
32 2020 7 12206 74.0
33 2020 7 29306 22.0
34 2020 7 33426 32.0
35 2020 7 37206 10.0
36 2020 7 45216 34.0
I want to fill the NaN values with the Average of Asin_code for that particular year.I am able to fill the values for 2016 with this code
df["Asin_code"]=df.Asin_code.fillna(df.Asin_code.loc[(df.Year==2016)].mean(),axis=0)
But unable to do with the whole dataframe..
Use groupby().transform() and fillna:
df['Asin_code'] = df['Asin_code'].fillna(df.groupby('Year').Asin_code.transform('mean'))

Pandas count monthle rainy vs not rainy days starting from hourly data

I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00) and the mm of rain. Measurements are taken every 5 minutes for many years:
mm
tstamp
2010-01-01 00:00:00 0.0
2010-01-01 00:05:00 0.0
2010-01-01 00:10:00 0.0
2010-01-01 00:15:00 0.0
2010-01-01 00:20:00 0.0
........
What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following
tstamp rainy not rainy
2010-01 11 20
2010-02 20 8
......
2012-10 15 16
2012-11 30 0
What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}, made with this small code snippet:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(dict))
for year in df.index.year.unique():
try:
for month in df.index.month.unique():
a = df['{}-{}'.format(year, month)].resample('D').sum()
d[year][month]['rainy'] = a[a['mm'] != 0].count()
d[year][month]['not_rainy'] = a[a['mm'] == 0].count()
except:
pass
But I think I'm missing an easier and more straightforward solution. Any suggestion?
One way is to do two groupby:
daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
monthly = (daily.groupby(daily.index.to_period('M'))
.value_counts()
.unstack()
)
You can do this, I don't see any non-rainy months:
df = pd.read_csv('rain.csv')
df['tstamp'] = pd.to_datetime(df['tstamp'])
df['month'] = df['tstamp'].dt.month
df['year'] = df['tstamp'].dt.year
df = df.groupby(by=['year', 'month'], as_index=False).sum()
print(df)
Output:
year month mm
0 2010 1 1.0
1 2010 2 15.4
2 2010 3 21.8
3 2010 4 9.6
4 2010 5 118.4
5 2010 6 82.8
6 2010 7 96.0
7 2010 8 161.6
8 2010 9 109.2
9 2010 10 51.2
10 2010 11 52.4
11 2010 12 39.6
12 2011 1 5.6
13 2011 2 0.8
14 2011 3 13.4
15 2011 4 1.8
16 2011 5 97.6
17 2011 6 167.8
18 2011 7 128.8
19 2011 8 67.6
20 2011 9 155.8
21 2011 10 71.6
22 2011 11 0.4
23 2011 12 29.4
24 2012 1 17.6
25 2012 2 2.2
26 2012 3 13.0
27 2012 4 55.8
28 2012 5 36.8
29 2012 6 108.4
30 2012 7 182.4
31 2012 8 191.8
32 2012 9 89.0
33 2012 10 93.6
34 2012 11 161.2
35 2012 12 26.4

How to put Month and Year in the same Column in Python Pandas

Column1 Month Quantity Year
48 4 12.00 2006
49 5 13.00 2006
50 6 46.00 2006
51 7 11.00 2006
52 8 18.00 2006
53 9 16.00 2006
54 10 28.00 2006
83 1 6.00 2006
How can I merge the month column with the year column, and get meaningful time data?
In [42]: df['Date'] = pd.to_datetime(df.assign(Day=1).loc[:, ['Year','Month','Day']])
In [43]: df
Out[43]:
Column1 Month Quantity Year Date
0 48 4 12.0 2006 2006-04-01
1 49 5 13.0 2006 2006-05-01
2 50 6 46.0 2006 2006-06-01
3 51 7 11.0 2006 2006-07-01
4 52 8 18.0 2006 2006-08-01
5 53 9 16.0 2006 2006-09-01
6 54 10 28.0 2006 2006-10-01
7 83 1 6.0 2006 2006-01-01
Or much nicer solution from #piRSquared:
In [55]: df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(Day=1))
In [56]: df
Out[56]:
Column1 Month Quantity Year Date
0 48 4 12.0 2006 2006-04-01
1 49 5 13.0 2006 2006-05-01
2 50 6 46.0 2006 2006-06-01
3 51 7 11.0 2006 2006-07-01
4 52 8 18.0 2006 2006-08-01
5 53 9 16.0 2006 2006-09-01
6 54 10 28.0 2006 2006-10-01
7 83 1 6.0 2006 2006-01-01
df['Date'] = pd.to_datetime(df.Year.astype(str) + '-' + df.Month.astype(str))
print(df)
Column1 Month Quantity Year Date
0 48 4 12.0 2006 2006-04-01
1 49 5 13.0 2006 2006-05-01
2 50 6 46.0 2006 2006-06-01
3 51 7 11.0 2006 2006-07-01
4 52 8 18.0 2006 2006-08-01
5 53 9 16.0 2006 2006-09-01
6 54 10 28.0 2006 2006-10-01
7 83 1 6.0 2006 2006-01-01
​

Categories

Resources