I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00) and the mm of rain. Measurements are taken every 5 minutes for many years:
mm
tstamp
2010-01-01 00:00:00 0.0
2010-01-01 00:05:00 0.0
2010-01-01 00:10:00 0.0
2010-01-01 00:15:00 0.0
2010-01-01 00:20:00 0.0
........
What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following
tstamp rainy not rainy
2010-01 11 20
2010-02 20 8
......
2012-10 15 16
2012-11 30 0
What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}, made with this small code snippet:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(dict))
for year in df.index.year.unique():
try:
for month in df.index.month.unique():
a = df['{}-{}'.format(year, month)].resample('D').sum()
d[year][month]['rainy'] = a[a['mm'] != 0].count()
d[year][month]['not_rainy'] = a[a['mm'] == 0].count()
except:
pass
But I think I'm missing an easier and more straightforward solution. Any suggestion?
One way is to do two groupby:
daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
monthly = (daily.groupby(daily.index.to_period('M'))
.value_counts()
.unstack()
)
You can do this, I don't see any non-rainy months:
df = pd.read_csv('rain.csv')
df['tstamp'] = pd.to_datetime(df['tstamp'])
df['month'] = df['tstamp'].dt.month
df['year'] = df['tstamp'].dt.year
df = df.groupby(by=['year', 'month'], as_index=False).sum()
print(df)
Output:
year month mm
0 2010 1 1.0
1 2010 2 15.4
2 2010 3 21.8
3 2010 4 9.6
4 2010 5 118.4
5 2010 6 82.8
6 2010 7 96.0
7 2010 8 161.6
8 2010 9 109.2
9 2010 10 51.2
10 2010 11 52.4
11 2010 12 39.6
12 2011 1 5.6
13 2011 2 0.8
14 2011 3 13.4
15 2011 4 1.8
16 2011 5 97.6
17 2011 6 167.8
18 2011 7 128.8
19 2011 8 67.6
20 2011 9 155.8
21 2011 10 71.6
22 2011 11 0.4
23 2011 12 29.4
24 2012 1 17.6
25 2012 2 2.2
26 2012 3 13.0
27 2012 4 55.8
28 2012 5 36.8
29 2012 6 108.4
30 2012 7 182.4
31 2012 8 191.8
32 2012 9 89.0
33 2012 10 93.6
34 2012 11 161.2
35 2012 12 26.4
Related
Year Week_Number DC_Zip Asin_code
1 2016 1 84105 NaN
2 2016 1 85034 NaN
3 2016 1 93711 NaN
4 2016 1 98433 NaN
5 2016 2 12206 21.0
6 2016 2 29306 10.0
7 2016 2 33426 11.0
8 2016 2 37206 1.0
9 2017 1 12206 266.0
10 2017 1 29306 81.0
11 2017 1 33426 NaN
12 2017 1 37206 NaN
13 2017 1 45216 99.0
14 2017 1 60160 100.0
15 2017 1 76110 76.0
16 2018 1 12206 562.0
17 2018 1 29306 184.0
18 2018 1 33426 NaN
19 2018 1 37206 NaN
20 2018 1 45216 187.0
21 2018 1 60160 192.0
22 2018 1 76110 202.0
23 2019 1 12206 511.0
24 2019 1 29306 NaN
25 2019 1 33426 224.0
26 2019 1 37206 78.0
27 2019 1 45216 160.0
28 2019 1 60160 NaN
29 2019 1 76110 221.0
30 2020 6 93711 NaN
31 2020 6 98433 NaN
32 2020 7 12206 74.0
33 2020 7 29306 22.0
34 2020 7 33426 32.0
35 2020 7 37206 10.0
36 2020 7 45216 34.0
I want to fill the NaN values with the Average of Asin_code for that particular year.I am able to fill the values for 2016 with this code
df["Asin_code"]=df.Asin_code.fillna(df.Asin_code.loc[(df.Year==2016)].mean(),axis=0)
But unable to do with the whole dataframe..
Use groupby().transform() and fillna:
df['Asin_code'] = df['Asin_code'].fillna(df.groupby('Year').Asin_code.transform('mean'))
Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
How can I add 0 infront of week if the len is 1. I need to merge Year and Week together as 201702
Try this
df["Week"] = df.Week.astype('str').str.zfill(2)
i have the following dataframe
id value year audit
1 21 2007 NaN
1 36 2008 2011
1 7 2009 Nan
2 44 2007 NaN
2 41 2008 Nan
2 15 2009 Nan
3 51 2007 NaN
3 15 2008 2011
3 51 2009 Nan
4 10 2007 NaN
4 12 2008 Nan
4 24 2009 2011
5 30 2007 2011
5 35 2008 Nan
5 122 2009 Nan
Basically, I want to create another variable audit2 where all the cells are 2011, if at least one audit is 2011, for each id.
I tried to put an if-statement inside a loop, but I cannot get any results
I would like to get this new dataframe
id value year audit audit2
1 21 2007 NaN 2011
1 36 2008 2011 2011
1 7 2009 Nan 2011
2 44 2007 NaN NaN
2 41 2008 Nan NaN
2 15 2009 Nan NaN
3 51 2007 NaN 2011
3 15 2008 2011 2011
3 51 2009 Nan 2011
4 10 2007 NaN 2011
4 12 2008 Nan 2011
4 24 2009 2011 2011
5 30 2007 2011 2011
5 35 2008 Nan 2011
5 122 2009 Nan 2011
Could you help me please?
df.groupby('id')['audit'].transform(lambda s: s[s.first_valid_index()] if s.first_valid_index() else np.nan)
output:
>>> df
0 2011.0
1 2011.0
2 2011.0
3 NaN
4 NaN
5 NaN
6 2011.0
7 2011.0
8 2011.0
9 2011.0
10 2011.0
11 2011.0
12 2011.0
13 2011.0
14 2011.0
Name: audit, dtype: float64
I have a pandas dataframe that looks like this:
year week val1 val2
0 2017 45 10.1 20.2
0 2017 48 10.3 20.3
0 2017 49 10.4 20.4
0 2017 52 10.3 20.5
0 2018 1 10.1 20.2
0 2018 2 10.3 20.3
0 2018 5 10.4 20.4
0 2018 9 10.3 20.5
....
Notice that the weeks are not contiguous. What is the best way to fill in the rows that are missing, with the val1 and val2 numbers as NaN? E.g so that my year would be from 2017 to 2018 and my weeks would be 45-52 and 1-9.
Thanks so much.
You can groupby year and then reindex with the union of existing and missing values:
(df.set_index("week")
.groupby("year")
.apply(lambda x: x.reindex(x.index.union(np.arange(x.index.min(),x.index.max()))))
.drop("year", 1)
.reset_index()
.rename(columns={"level_1":"week"}))
year week val1 val2
0 2017 45 10.1 20.2
1 2017 46 nan nan
2 2017 47 nan nan
3 2017 48 10.3 20.3
4 2017 49 10.4 20.4
5 2017 50 nan nan
6 2017 51 nan nan
7 2017 52 10.3 20.5
8 2018 1 10.1 20.2
9 2018 2 10.3 20.3
10 2018 3 nan nan
11 2018 4 nan nan
12 2018 5 10.4 20.4
13 2018 6 nan nan
14 2018 7 nan nan
15 2018 8 nan nan
16 2018 9 10.3 20.5
I'd create a reference dataframe and merge
ref = pd.DataFrame(
[[y, w] for y, s in df.groupby('year').week for w in range(s.min(), s.max() + 1)],
columns=['year', 'week']
)
ref.merge(df, 'left')
year week val1 val2
0 2017 45 10.1 20.2
1 2017 46 NaN NaN
2 2017 47 NaN NaN
3 2017 48 10.3 20.3
4 2017 49 10.4 20.4
5 2017 50 NaN NaN
6 2017 51 NaN NaN
7 2017 52 10.3 20.5
8 2018 1 10.1 20.2
9 2018 2 10.3 20.3
10 2018 3 NaN NaN
11 2018 4 NaN NaN
12 2018 5 10.4 20.4
13 2018 6 NaN NaN
14 2018 7 NaN NaN
15 2018 8 NaN NaN
16 2018 9 10.3 20.5
I'd make use of Time Series / Date functionality. Combining and converting year and week columns into a datetime index and resampling your dataframe with something like:
df.index = pd.to_datetime(
df.year.map(str) + " " + df.week.map(str) + " 3",
format="%Y %W %w"
)
df = df.resample("W").mean()
df.year = df.index.year
df.week = df.index.week
Note that your index is overwritten.
I have two data frames in python. The first is raw rainfall data for a single day of year and the second is the sum of daily rainfall using group.by.
One data frame looks like this (with many more rows in between device_ids):
>>> df1
device_id rain day month year
0 9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
1 9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
6 e7z581f0-2693-42ad-9896-0048550ccda7 0.0 31 12 2016
11 e7z581f0-2693-42ad-9896-0048550ccda7 0.0 31 12 2016
12 ceez972b-135f-45b3-be4w-7c23102676bq 0.2 31 12 2016
13 ceez972b-135f-45b3-be4w-7c23102676bq 0.0 31 12 2016
18 ceez972b-135f-45b3-be4w-7c23102676bq 0.0 31 12 2016
19 1d28dz3a-c923-4967-a7bb-5881d232c9a7 0.0 31 12 2016
24 1d28dz3a-c923-4967-a7bb-5881d232c9a7 0.0 31 12 2016
25 a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 0.0 31 12 2016
29 a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 0.0 31 12 2016
29 a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 0.0 31 12 2016
... ... ... ... ... ...
3903 9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
3904 9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
3905 9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
And the other looks something like this:
>>> df2
rain
device_id
1d28dz3a-c923-4967-a7bb-5881d232c9a7 0.0
9z849362-b05d-4317-96f5-f267c1adf8d6 0.0
a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 1.2
ceez972b-135f-45b3-be4w-7c23102676bq 2.2
e7z581f0-2693-42ad-9896-0048550ccda7 0.2
... which I got by using:
df2 = df1.groupby(['device_id'])[["rain"]].sum()
I want my final data frame to look like this:
>>> df3
rain day month year
device_id
1d28dz3a-c923-4967-a7bb-5881d232c9a7 0.0 31 12 2016
9z849362-b05d-4317-96f5-f267c1adf8d6 0.0 31 12 2016
a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 1.2 31 12 2016
ceez972b-135f-45b3-be4w-7c23102676bq 2.2 31 12 2016
e7z581f0-2693-42ad-9896-0048550ccda7 0.2 31 12 2016
Which is to say that I want the "day month year" columns from df1 to be added to df2. I'm not sure if I should use merge, append, or do something else.
Maybe this will work? groupby day month and year as well.
df.groupby(['device_id', 'day', 'month', 'year']).sum()
rain
device_id day month year
1d28dz3a-c923-4967-a7bb-5881d232c9a7 31 12 2016 0.0
9z849362-b05d-4317-96f5-f267c1adf8d6 31 12 2016 0.0
a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 31 12 2016 0.0
ceez972b-135f-45b3-be4w-7c23102676bq 31 12 2016 0.2
e7z581f0-2693-42ad-9896-0048550ccda7 31 12 2016 0.0
Or you could add reset_index to return these columns to the DataFrame like
df.groupby(['device_id', 'day', 'month', 'year']).sum().reset_index()
0 1d28dz3a-c923-4967-a7bb-5881d232c9a7 31 12 2016 0.0
1 9z849362-b05d-4317-96f5-f267c1adf8d6 31 12 2016 0.0
2 a044ag4f-fd7c-4ae4-bff3-9158cebad3b1 31 12 2016 0.0
3 ceez972b-135f-45b3-be4w-7c23102676bq 31 12 2016 0.2
4 e7z581f0-2693-42ad-9896-0048550ccda7 31 12 2016 0.0
Or the following should match your index / column structure exactly.
df.groupby(['device_id', 'day', 'month', 'year']).sum().reset_index([1, 2, 3])