I created an hourly dates dataframe, and now I would like to create a column that flags whether each row (hour) is in Daylight Saving Time or not. For example, in summer hours, the flag should == 1, and in winter hours, the flag should == 0.
# Localized dates dataframe
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])
# My failed attempt to create the flag column
dates['dst_flag'] = np.where(dates['date_time'].dt.daylight_saving_time == True, 1, 0)
There's a nice link in the comments that at least let you do this manually. AFAIK, there isn't a vectorized way to do this.
import pandas as pd
import numpy as np
from pytz import timezone
# Generate data (as opposed to index)
date_range = pd.to_datetime(pd.date_range('1/1/2018', '1/1/2019', freq='h', tz='America/Denver'))
date_range = [date for date in date_range]
# Localized dates dataframe
df = pd.DataFrame(data=date_range, columns=['date_time'])
# Map transition times to year for some efficiency gain
tz = timezone('America/Denver')
transition_times = tz._utc_transition_times[1:]
transition_times = [t.astimezone(tz) for t in transition_times]
transition_times_by_year = {}
for start_time, stop_time in zip(transition_times[::2], transition_times[1::2]):
year = start_time.year
transition_times_by_year[year] = [start_time, stop_time]
# If the date is in DST, mark true, else false
def mark_dst(dates):
for date in dates:
start_dst, stop_dst = transition_times_by_year[date.year]
yield start_dst <= date <= stop_dst
df['dst_flag'] = [dst_flag for dst_flag in mark_dst(df['date_time'])]
# Do a quick sanity check to make sure we did this correctly for year 2018
dst_start = df[df['dst_flag'] == True]['date_time'][0] # First dst time 2018
dst_end = df[df['dst_flag'] == True]['date_time'][-1] # Last dst time 2018
print(dst_start)
print(dst_end)
this outputs:
2018-03-11 07:00:00-06:00
2018-11-04 06:00:00-07:00
which is likely correct. I didn't do the UTC conversions by hand or anything to check that the hours are exactly right for the given timezone. You can at least verify the dates are correct with a quick google search.
Some gotchas:
pd.date_range generates an index, not data. I changed your original code slightly to make it be data as opposed to the index. I assume you have the data already.
There's something goofy about how tz._utc_transition_times is structured. It's start/stop utc DST transition times, but there is some goofy stuff in the early dates. It should be good from 1965 onward though. If you are doing dates earlier than that change tz._utc_transition_times[1:] to tz._utc_transition_times. Note not all years before 1965 are present.
tz._utc_transition_times is "Python private". It is liable to change without warning or notice, and may or may not work for future or past versions of pytz. I'm using pytz verion 2017.3. I recommend you run this code to make sure the output matches, and if not, make sure to use version 2017.3.
HTH, good luck with your research/regression problem!
If you are looking for a vectorized way of doing this (which you probably should be), you can use something like the code below.
The fundamental idea behind this is to find the difference between the current time in your timezone and the UTC time. In the winter months, the difference will be one extra hour behind UTC. Whatever the difference is, add what is needed to get to the 1 or 0 for the flag.
In Denver, summer months are UTC-6 and winter months are UTC-7. So, if you take the difference between the tz-aware time in Denver and UTC time, then add 7, you'll get a value of 1 for summer months and a value of 0 for winter months.
import pandas as pd
start = pd.to_datetime('2020-10-30')
end = pd.to_datetime('2020-11-02')
dates = pd.date_range(start=start, end=end, freq='h', tz='America/Denver')
df1 = pd.DataFrame({'dst_flag': 1, 'date1': dates.tz_localize(None)}, index=dates)
# add extra day on each end so that there are no nan's after the join
dates = pd.to_datetime(pd.date_range(start=start - pd.to_timedelta(1, 'd'), end=end + pd.to_timedelta(1, 'd'), freq='h'), utc=True)
df2 = pd.DataFrame({'date2': dates.tz_localize(None)}, index=dates)
out = df1.join(df2)
out['dst_flag'] = (out['date1'] - out['date2']) / pd.to_timedelta(1, unit='h') + 7
out.drop(columns=['date1', 'date2'], inplace=True)
Here is what I ended up doing, and it works for my purposes:
import pandas as pd
import pytz
# Create dates table and flag Daylight Saving Time dates
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2018-12-31-23', freq='h'), columns=['date_time'])
# Create a list of start and end dates for DST in each year, in UTC time
dst_changes_utc = pytz.timezone('America/Denver')._utc_transition_times[1:]
# Convert to local times from UTC times and then remove timezone information
dst_changes = [pd.Timestamp(i).tz_localize('UTC').tz_convert('America/Denver').tz_localize(None) for i in dst_changes_utc]
flag_list = []
for index, row in dates['date_time'].iteritems():
# Isolate the start and end dates for DST in each year
dst_dates_in_year = [date for date in dst_changes if date.year == row.year]
spring = dst_dates_in_year[0]
fall = dst_dates_in_year[1]
if (row >= spring) & (row < fall):
flag = 1
else:
flag = 0
flag_list.append(flag)
print(flag_list)
dates['dst_flag'] = flag_list
del(flag_list)
the following vectorized way seem to work fine.
The idea behind is the same as Nick Klavoht's idea : find the difference between the current time in your timezone and the utc time.
# Localized dates dataframe
df = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])
df['utc_offset'] = df['date_time'].dt.strftime('%z').str[0:3].astype(float)
df['utc_offset_shifted'] = df['utc_offset'].shift(-1)
df['dst'] = df['utc_offset'] - df['utc_offset_shifted']
df_dst = df[(df['dst'] != 0) & (df['dst'])]
df_dst = df_dst.drop(['utc_offset', 'utc_offset_shifted'], axis=1).reset_index(drop=True)
print(df_dst)
This outputs :
date_time dst
0 2018-03-11 01:00:00-07:00 -1.0
1 2018-11-04 01:00:00-06:00 1.0
If you know what time zone you are dealing with you could use:
dates['dst_flag'] = dates['date_time'].apply(lambda x: x.tzname() == 'CEST')
This would flag the all hours in CET as False and in CEST as True. I'm not sure if I'd want to do that on a huge column.
Related
I'm trying to group an xarray.Dataset object into a custom 5-month period spanning from October-January with an annual frequency. This is complicated because the period crosses New Year.
I've been trying to use the approach
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
wb_start1 = wb_start.groupby('time.year')
But this predictably makes the January month of the same year, instead of +1 year. Any help would be appreciated!
I fixed this in a somewhat clunk albeit effective way by adding a year to the months after January. My method essentially moves the months 10,11,12 up one year while leaving the January data in place, and then does a groupby(year) instance on the reindexed time data.
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
# convert cftime to datetime
datetimeindex = wb_start.indexes['time'].to_datetimeindex()
wb_start['time'] = pd.to_datetime(datetimeindex)
# Add custom group by year functionality
custom_year = wb_start['time'].dt.year
# convert time type to pd.Timestamp
time1 = [pd.Timestamp(i) for i in custom_year['time'].values]
# Add year to Timestamp objects when month is before Jan. (relativedelta does not work from np.datetime64)
time2 = [i + relativedelta(years=1) if i.month>=10 else i for i in time1]
wb_start['time'] = time2
#Groupby using the new time index
wb_start1 = wb_start.groupby('time.year')
Here is the code for sample simulated data. Actual data can have varying start and end dates.
import pandas as pd
import numpy as np
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
dfb=df.resample('B').apply(lambda x:x[-1])
From the dfb, I want to select the rows that contain values for all the days of the month.
In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.
For this particular dataset, I could do
df_out=dfb['2010-02':'2019-12']
But please help me with a better solution
Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.
When you say "better" solution - I assume you mean make the range dynamic based on input data.
OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:
import pandas as pd
import numpy as np
from datetime import date, timedelta
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])
# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
new_month = fd.month + 1
if ( fd.month == 12 ):
new_month = 1
first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
first_day_of_next_month = fd
# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
last_day_of_prev_month = ld # keeps the index if month is changed
else:
last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)
df_out=dfb[first_day_of_next_month:last_day_of_prev_month]
There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.
I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.
dfb['year'] = dfb.index.year # col#1
dfb['month'] = dfb.index.month # col#2
first_month = (dfb['year']==dfb.iloc[0, 1]) & (dfb['month']==dfb.iloc[0, 2])
last_month = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2])
dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
I have a basic dataframe that is read into pandas, with a few rows of existing data that don't matter much.
df = pd.read_csv('myfile.csv')
df['Date'] = pd.to_datetime(df['Date'])
I need to be able to come up with a method that will allow me to loop through between two dates and add these as new rows. These dates are on a cycle, 21 days out of 28 day cycle. So if the start date was 4/1/13 and my end date was 6/1/19, I want to be able to add a row for each date, 21 days on and off for a week.
Desired output:
A, Date
x, 4/1/13
x, 4/2/13
x, 4/3/13
x, 4/4/13
x, 4/5/13
... cont'd
x, 4/21/13
y, 4/29/13
y, 4/30/13
... cont'd
You can see that between x and y there was a new cycle.
I think I am supposed to use Datetime for this but please correct me if I am wrong. I am not sure where to start.
EDIT
I started with this:
import datetime
# The size of each step in days
day_delta = datetime.timedelta(days=1)
start_date = datetime.date(2013, 4, 1)
end_date = start_date + 21*day_delta
for i in range((end_date - start_date).days):
print(start_date + i*day_delta)
And got this:
2013-04-01
2013-04-02
2013-04-03
2013-04-04
2013-04-05
2013-04-06
2013-04-07
2013-04-08
2013-04-09
2013-04-10
2013-04-11
2013-04-12
2013-04-13
2013-04-14
2013-04-15
2013-04-16
2013-04-17
2013-04-18
2013-04-19
2013-04-20
2013-04-21
But I am not sure how to implement the cycle in here.
TYIA!
Interesting question, I spent almost half an hour on this.
Yes, you will need the datetime module for this.
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(100)]
I made a list of dates as you did. This is a list of datetime.timedelta objects. I recommend you convert all your dates into this format to make calculations easier. We set a base date (the first day) to compare with the rest later on in a loop.
date_list_filtered = []
for each in enumerate(date_list):
date_list_filtered.append(each[1].strftime('%d/%m/%y'))
strftime() changes the datetime.datetime object into a readable date, my own preference is using the dd/mm/yy format. You can look up different formats online.
df = pd.DataFrame({'Raw':date_list,'Date':date_list_filtered})
Here I made a loop to count the difference in days between each date in the loop and the base date, changing the base date every time it hits -21.
Edit: Oops I did 21 days instead of 28, but I'm sure you can tweak it
base = df['Raw'][0]
unique_list = []
no21 = 0
for date in df['Raw'].values:
try:
res = (date-base).days
except:
res = (date-base).astype('timedelta64[D]')/np.timedelta64(1, 'D')
if res==-21.0:
base = date
#print(res)
unique_list.append(string.ascii_letters[no21])
no21+=1
else:
unique_list.append(string.ascii_letters[no21])
I used the string library to get the unique letters I wanted.
Lastly, put it in the data frame.
df['Unique'] = unique_list
Thanks for asking this question, it was really fun.
You can floor divide the difference in days from the start date by 28 to get the number of cycles.
date_start = datetime.datetime(2013, 4, 1)
date1 = datetime.datetime(2013, 5, 26)
And to check the difference
diff_days = (date1-date_start).days
diff_days
55
cycle = (date1-date_start).days//28
cycle
1
Then you can sum over the dates within the same cycle.
I'm trying to import some timeseries data and convert it into UTC so I can merge it with another dataset. This data seems to have 24 hour data and doesn't have DST adjustments. This post gives a similar answer, but they simply drop the line. I need to shift it so I can merge it with my other data.
When I run my code:
df = pd.read_csv('http://rredc.nrel.gov/solar/old_data/nsrdb/1991-2010/data/hourly/{}/{}_{}_solar.csv'.format(723898,723898,1998), usecols=["YYYY-MM-DD", "HH:MM (LST)","Meas Glo (Wh/m^2)","Meas Dir (Wh/m^2)","Meas Dif (Wh/m^2)"])
def clean_time(obj):
hour = int(obj[0:-3])
hour = str(hour - 1)
if len(str(hour)) == 2:
return hour+":00"
else:
return "0" + hour + ":00"
df['HH:MM (LST)'] = df['HH:MM (LST)'].apply(clean_time)
df['DateTime'] = df['YYYY-MM-DD'] + " " + df['HH:MM (LST)']
df = df.set_index(pd.DatetimeIndex(df['DateTime']))
df.drop(["YYYY-MM-DD", "HH:MM (LST)",'DateTime'],axis=1,inplace=True)
df.index = df.index.tz_localize('US/Pacific', ambiguous='infer')
I get:
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 1998-10-25 01:00:00 as there are no repeated times
If I leave ambiguous='raise' (the default), it gives me:
pytz.exceptions.NonExistentTimeError: 1998-04-05 02:00:00
So I'm stuck on either the start, or end of daylight savings time.
There's quite a few of these datasets (multiple sites over multiple years) I need to merge, so I'd prefer not to hand code specific hours to shift, but I'm still a novice and can't quite figure out my next steps here.
Appreciate the help!
Minimal reproduction scenario:
from datetime import datetime, timedelta
import pandas as pd
df = pd.DataFrame([[datetime(2019, 10, 27, 0) + timedelta(hours=i), i] for i in range(24)], columns=['dt', 'i']).set_index('dt')
df.index.tz_localize('Europe/Amsterdam', ambiguous='infer')
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2019-10-27 02:00:00 as there are no repeated times
Solution: manually specify which datetime objects must be considered DT (Daylight Time) or DST (Daylight Savings Time). See documentation.
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
df = pd.DataFrame([[datetime(2019, 10, 27, 0) + timedelta(hours=i), i] for i in range(24)], columns=['dt', 'i']).set_index('dt')
infer_dst = np.array([False] * df.shape[0]) # all False -> every row considered DT, alternative is True to indicate DST. The array must correspond to the iloc of df.index
df.index.tz_localize('Europe/Amsterdam', ambiguous=infer_dst) # no error