Using numpy, how do you calculate snowfall per month? - python

I have a data set with snowfall records per day for one year. Date variable is in YYYYMMDD form.
Date Snow
20010101 0
20010102 10
20010103 5
20010104 3
20010105 0
...
20011231 0
The actual data is here
https://github.com/emily737373/emily737373/blob/master/COX_SNOW-1.csv
I want to calculate the number of days it snowed each month. I know how to do this with pandas, but for a school project, I need to do it only using numpy. I can not import datetime either, it must be done only using numpy.
The output should be in this form
Month # days snowed
January 13
February 19
March 20
...
December 15
My question is how do I only count the number of days it snowed (basically when snow variable is not 0) without having to do it separately for each month?

I hope you can use some built-in packages, such as datetime, cause it's useful when working with datetime objects.
import numpy as np
import datetime as dt
df = np.genfromtxt('test_files/COX_SNOW-1.csv', delimiter=',', skip_header=1, dtype=str)
date = np.array([dt.datetime.strptime(d, "%Y%m%d").month for d in df[:, 0]])
snow = df[:, 1].copy().astype(np.int32)
has_snowed = snow > 0
for month in range(1, 13):
month_str = dt.datetime(year=1, month=month, day=1).strftime('%B')
how_much_snow = len(snow[has_snowed & (date == month)])
print(month_str, ':', how_much_snow)
I loaded the data as str so we guarantee we can parse the Date column as dates later on. That's why we also need to explicitly convert the snow column to int32, otherwise the > comparison won't work.
The output is as follows:
January : 13
February : 19
March : 20
April : 13
May : 8
June : 9
July : 2
August : 7
September : 9
October : 19
November : 16
December : 15
Let me know if this worked for you or if you have any further questions.

Related

How to eliminate leap years in pandas data frame

I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]
but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.
daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')
The source of your problem is that you used daily.date.dt.dayofyear.
Each day in a year, including Feb 29 has its own number.
To make thing worse, e.g. Mar 1 has dayofyear:
61 in leap years,
60 in non-leap years.
One of possible solutions is to set the day column to a string
representation of month and day.
To provide proper sort in the pivoted table, the month part should come first.
So, after you convert date column to datetime, to create both
additional columns run:
daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')
Then you can filter out Feb 29 and generate the pivot table in one go:
result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
values='tmax')
For some limited source data sample, other than yours, I got:
day 02-27 02-28 03-01 03-02
year
2020 11 10 14 15
2021 11 21 22 24
An alternative
Create 3 additional columns:
daily['year'] = daily.date.dt.year
daily['month'] = daily.date.dt.strftime('%m')
daily['day'] = daily.date.dt.strftime('%d')
Note the string representation of month and day, to keep leading
zeroes.
Then filter out Feb 29 and generate the pivot table with a MulitiIndex
on columns:
result = daily[(daily.month != '02') | (daily.day != '29')].pivot(
index='year', columns=['month', 'day'], values='tmax')
This time the result is:
month 02 03
day 27 28 01 02
year
2020 11 10 14 15
2021 11 21 22 24
The easy way is to eliminate those items before building the array.
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)
You get 366 columns because of using dayofyear. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).
To see this:
>>> daily.iloc[1154:1157]
date tmax day year
1154 1904-02-28 38.000000 59 1904
1156 1904-03-01 39.000000 61 1904
1157 1904-03-02 37.000000 62 1904
Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).

select a specifc day from a data set, else the next working day if not available

I have a large dataset spanning many years and I want to subset this data frame by selecting data based on a specific day of the month using python.
This is simple enough and I have achieved with the following line of code:
df[df.index.day == 12]
This selects data from the 12th of each month for all years in the data set. Great.
The problem I have however is the original data set is based on working day data. Therefore the 12th might actually be a weekend or national holiday and thus doesnt appear in the data set. Nothing is returned for that month as such.
What I would like to happen is to select the 12th where available, else select the next working day in the data set.
All help appreciated!
Here's a solution that looks at three days from every month (12, 13, and 14), and then picks the minimum. If the 12th is a weekend it won't exist in the original dataframe, and you'll get the 13th. The same goes for the 14th.
Here's the code:
# Create dummy data - initial range
df = pd.DataFrame(pd.date_range("2018-01-01", "2020-06-01"), columns = ["date"])
# Create dummy data - Drop weekends
df = df[df.date.dt.weekday.isin(range(5))]
# get only the 12, 13, and 14 of every month
# group by year and month.
# get the minimum
df[df.date.dt.day.isin([12, 13, 14])].groupby(by=[df.date.dt.year, df.date.dt.month], as_index=False).min()
Result:
date
0 2018-01-12
1 2018-02-12
2 2018-03-12
3 2018-04-12
4 2018-05-14
5 2018-06-12
6 2018-07-12
7 2018-08-13
8 2018-09-12
9 2018-10-12
...
Edit
Per a question in the comments about national holidays: the same solution applies. Instead of picking 3 days (12, 13, 14), pick a larger number (e.g. 12-18). Then, get the minimum of these that actually exists in the dataframe - and that's the first working day starting with the 12th.
You can backfill the dataframe first to fill the missing values then select the date you want
df = df.asfreq('d', method='bfill')
Then you can do df[df.index.day == 12]
This is my approach, I will explain each line below the code. Please feel free to add a comment if there's something unclear:
!pip install workalendar #Install the module
import pandas as pd #Import pandas
from workalendar.usa import NewYork #Import the required country and city
df = pd.DataFrame(pd.date_range(start='1/1/2018', end='12/31/2018')).rename(columns={0:'Dates'}) #Create a dataframe with dates for the year 2018
cal = NewYork() #Instance the calendar
df['Is_Working_Day'] = df['Dates'].map(lambda x: cal.is_working_day(x)) #Create an extra column, True for working days, False otherwise
df[(df['Dates'].dt.day >= 12) & (df['Is_Working_Day'] == True)].groupby(df['Dates'].dt.month)['Dates'].first()
Essentially this last line returns all days with values equal or higher than 12 that are actual working days, we then group them by month and return the first day for each where this condition is met (day >= 12 and Working_day = True).
Output:
Dates
1 2018-01-12
2 2018-02-13
3 2018-03-12
4 2018-04-12
5 2018-05-14
6 2018-06-12
7 2018-07-12
8 2018-08-13
9 2018-09-12
10 2018-10-12
11 2018-11-13
12 2018-12-12

Is there some Python function like .to_period that could help me extract a fiscal year's week number based on a date?

Essentially, I want to apply some lambda function(?) of some sort to apply to a column in my dataframe that contains dates. Originally, I used dt.week to extract the week number but the calendar dates don't match up with the fiscal year I'm using (Apr 2019 - Mar 2020).
I have tried using pandas' function to_period('Q-MAR) but that seems to be a little bit off. I have been researching other ways but nothing seems to work properly.
Apr 1 2019 -> Week 1
Apr 3 2019 -> Week 1
Apr 30 2019 -> Week 5
May 1 2019 -> Week 5
May 15 2019 -> Week 6
Thank you for any advice or tips in advance!
You can create a DataFrame which contains the dates with a frequency of weeks:
date_rng = pd.date_range(start='01/04/2019',end='31/03/2020', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
You can then query df for which index the date is smaller than or equal to the value:
df.index[df.date <= query_date][-1]
This will output the largest index which is smaller than or equal to the date you want to examine. I imagine you can pour this into a lambda yourself?
NOTE
This solution has limitations, the biggest one being you have to manually define the datetime dataframe.
I did create a fiscal calendar that can be later improvised to create function in spark
from fiscalyear import *
beginDate = '2016-01-01'
endDate = '2021-12-31'
#create empty dataframe
df = spark.createDataFrame([()])
#create date from given date range
df1 = df.withColumn("date",explode(expr(f"sequence(to_date('{beginDate}'), to_date('{endDate}'), interval 1 day)")))
# get week
df1 = df1.withColumn('week',weekofyear(col("date"))).withColumn('year',year(col("date")))
#translate to use pandas in python
df1 = df1.toPandas()
#get fiscal year
df1['financial_year'] = df1['date'].map(lambda x: x.year if x.month > 3 else x.year-1)
df1['date'] = pd.to_datetime(df1['date'])
#get calendar qtr
df1['quarter_old'] = df1['date'].dt.quarter
#get fiscal calendar
df1['quarter'] = np.where(df1['financial_year']< (df1['year']),df1['quarter_old']+3,df1['quarter_old'])
df1['quarter'] = np.where(df1['financial_year'] == (df1['year']),df1['quarter_old']-1,df1['quarter'])
#get fiscal week by shiftin gas per number of months different from usual calendar
df1["fiscal_week"] = df1.week.shift(91)
df1 = df1.loc[(df1['date'] >= '2020-01-01')]
df1.display()

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Average hourly week profile for a year excluding weekend days and holidays

With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]

Categories

Resources