Resample daily pandas timeseries with start at time other than midnight [duplicate] - python

This question already has answers here:
Resample hourly TimeSeries with certain starting hour
(3 answers)
Closed 9 years ago.
I have a pandas timeseries of 10-min freqency data and need to find the maximum value in each 24-hour period. However, this 24-hour period needs to start each day at 5AM - not the default midnight which pandas assumes.
I've been checking out DateOffset but so far am drawing blanks. I might have expected something akin to pandas.tseries.offsets.Week(weekday=n), e.g. pandas.tseries.offsets.Week(hour=5), but this is not supported as far as I can tell.
I can do a nasty work around by shifting the data first, but it's unintuitive and even coming back to the same code after just a week I have problems wrapping my head around the shift direction!
Any more elegant ideas would be much appreciated.

The base keyword can do the trick (see docs):
s.resample('24h', base=5)
Eg:
In [35]: idx = pd.date_range('2012-01-01 00:00:00', freq='5min', periods=24*12*3)
In [36]: s = pd.Series(np.arange(len(idx)), index=idx)
In [38]: s.resample('24h', base=5)
Out[38]:
2011-12-31 05:00:00 29.5
2012-01-01 05:00:00 203.5
2012-01-02 05:00:00 491.5
2012-01-03 05:00:00 749.5
Freq: 24H, dtype: float64

I've just spotted an answered question which didn't come up on Google or Stack Overflow previously:
Resample hourly TimeSeries with certain starting hour
This uses the base parameter, which looks like an addition subsequent to Wes McKinney's Python for Data Analysis. I've given the parameter a go and it seems to do the trick.

Related

How to use pandas beyond the pd.Timestamp.min and pd.Timestamp.max value?

pd.Timestamp.min
pd.Timestamp.max
Timestamp('1677-09-21 00:12:43.145224193')
Timestamp('2262-04-11 23:47:16.854775807')
I found out that pandas has a min and max date value. If I need to have date beyond these values, is that possible ?
Is it not possible to move the min/max values, like a century window ?
Any alternatives without pandas then ?
Thanks very much.
This is a known limitation due to the nanosecond precision of Timestamps.
Timestamp limitations
Since pandas represents timestamps in nanosecond resolution, the time
span that can be represented using a 64-bit integer is limited to
approximately 584 years
The documentation suggests to use pandas.period_range:
Representing out-of-bounds span
If you have data that is outside of the Timestamp bounds, see
Timestamp limitations, then you can use a PeriodIndex and/or Series of
Periods to do computations.
pd.period_range("1215-01-01", "1381-01-01", freq="D")
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632)
converting a Series
there is no direct method (like to_period) to convert an exiting Series, you need to go through a PeriodIndex:
df = pd.DataFrame({'str': ['1900-01-01', '2500-01-01']})
df['period'] = pd.PeriodIndex(df['str'], freq='D').values
output:
print(df)
str period
0 1900-01-01 1900-01-01
1 2500-01-01 2500-01-01
print(df.dtypes)
str object
period period[D]
dtype: object

How to get the day difference between date-column and maximum date of same column or different column in Python?

I am setting up a new column as the day difference in Python (on Jupyter notebook).
I carried out the day difference between the column date and current day. Also, I carried out that the day difference between the date column and newly created day via current day (Current day -/+ input days with timedelta function).
However, whenever I use max() of the same column and different column, the day difference column has NaN values. It does not make sense for me, maybe I am missing the date type. When I checked the types all of them seems datetime64 (already converted to datetime64 by me).
I thought that the reason was having not big enough date. However, it happens with any specific date like max(datecolumn)+timedelta(days=i).
t=data_signups[["date_joined"]].max()
date_joined 2019-07-18 07:47:24.963450
dtype: datetime64[ns]
t = t + timedelta(30)
date_joined 2019-08-17 07:47:24.963450
dtype: datetime64[ns]
data_signups['joined_to_today'] = (t - data_signups['date_joined']).dt.days
data_signups.head(2)
shortened...
date_joined_______________// joined_to_today________
2019-05-31 10:52:06.327341 // nan
2019-04-02 09:20:26.520272 // nan
However it worked on Current day task like below.
Currentdate = datetime.datetime.now()
print(Currentdate)
2019-09-01 17:05:48.934362
before_days=int(input("Enter the number of days before today for analysis "))
30
Done
last_day_for_analysis = Currentdate - timedelta(days=before_days)
print(last_day_for_analysis)
2019-08-02 17:05:48.934362
data_signups['joined_to_today'] = (last_day_for_analysis - data_signups['date_joined']).dt.days
data_signups.head(2)
shortened...
date_joined_______________// joined_to_today________
2019-05-31 10:52:06.327341 // 63
2019-04-02 09:20:26.520272 // 122
I expect that there is datetype problem. However, I could not figure out since all of them are datetime64. There are no NaN values in the columns.
Thank you for your help. I am newbie and I try to learn everyday continuously.
Although I was busy with this question for 2 days, now I realized that I had a big mistake. Sorry to everyone.
The reason that can not take the maximum value as date comes from as below.
Existing one: t=data_signups[["date_joined"]].max()
Must-be-One: t=data_signups["date_joined"].max()
So it works with as below.
data_signups['joined_to_today'] = (data_signups['date_joined'].max() - data_signups['date_joined']).dt.days
data_signups.head(3)
There will be no two brackets. So stupid mistake. Thank you.

Pandas TimeGrouper: Drop "non full groups"

I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot()
I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this to happen because with biannual grouping, the second half (2014) is missing.
rolling_mean allows center=True, which will will put NaN/drop remainders on the left and right. Is there a similar feature for the Grouper? I couldn't find any on the manual, but perhaps there is a workaround?
I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of:
1) Drop enough observations (at the start or end) such that you have a multiple of 2 years worth of observations.
2) Extrapolate your starting (or ending) period such that it is comparable to the periods with complete data.
3) Normalize your data to 2 year sums based on underlying time periods of less than 2 years. This approach could be combined with the other two.
4) Instead of a groupby sort of approach, just do a rolling_sum.
Example dataframe:
rng = pd.date_range('1/1/2010', periods=60, freq='1m')
df = pd.DataFrame({ 'shopping' : np.random.choice(12,60) }, index=rng )
I just made the example data set with 5 years of data starting on Jan 1, so if you did this on an annual basis, you'd be done.
df.groupby([pd.TimeGrouper("AS", label='left')]).sum()['shopping']
Out[206]:
2010-01-01 78
2011-01-01 60
2012-01-01 76
2013-01-01 51
2014-01-01 60
Freq: AS-JAN, Name: shopping, dtype: int64
Here's your problem in table form, with the first 2 groups based on 2 years of data but the third group based on only 1 year of data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping']
Out[205]:
2010-01-01 138
2012-01-01 127
2014-01-01 60
Freq: 2AS-JAN, Name: shopping, dtype: int64
If you take approach (1) above, you just need to drop some observations. It's very easy to drop the later observations and retype the same command. It's a little trickier to drop the earlier observations because then your first observation doesn't begin on Jan 1 of an even year and you lose the automatic labelling and such. Here's an approach that will drop the first year and keep the last 4, but you lose the nice labelling (you can compare with annual data above to verify that this is correct):
In [202]: df2 = df[12:]
In [203]: df2['group24'] = (np.arange( len(df2) ) / 24 ).astype(int)
In [204]: df2.groupby('group24').sum()['shopping']
Out[204]:
group24
0 136
1 111
Alternatively, let's try approach (2), extrapolating. To do this, just replace sum() with mean() and multiply by 24. For the last period, this just means we assume that the 60 in 2014 will be equaled by another 60 in 2015. Whether or not this is reasonable will be a judgement call for you to make and you'd probably want to label with an asterisk and call it an estimate.
df.groupby([pd.TimeGrouper("2AS")]).mean()['shopping']*24
Out[208]:
2010-01-01 138
2012-01-01 127
2014-01-01 120
Freq: 2AS-JAN, Name: shopping, dtype: float64
Also keep in mind this is just one simple (probably simplistic) way you could extrapolate at the end of the period. Whether this is the best way to do it (or whether it makes sense to extrapolate at all) is a judgement call for you to make depending on the situation.
Next, you could take approach (3) and do some sort of normalization. I'm not sure exactly what you want, so I'll just sketch the ideas. If you want to display two year sums, you could just use the earlier example of replacing "2AS" with "AS" and then multiply by 2. This basically makes the table look wrong, but would be a really simple way to make the graph look OK.
Finally, just use rolling sum:
pd.rolling_sum(df.shopping,window=24)
Doesn't table well, but would plot well.

Calculate difference between two datetimes if both present in pandas DataFrame

I currently have various time columns (DateTime format) in a pandas DataFrame, as shown below:
Entry Time Exit Time
00:30:59.555 06:30:59.555
00:56:43.200
10:30:30.500 11:30:30.500
I would like to return the difference between these times (Exit Time - Entry Time) in a new column in the dataframe if both Entry Time and Exit Time are present. Otherwise, I would like to skip the row, as shown below:
Entry Time Exit Time Time Difference
00:30:59.555 06:30:59.555 06:00:00.000
00:56:43.200
10:30:30.500 12:00:30.500 01:30:00.000
I am fairly new to Python, so my apologies if this is an obvious question. Any help would be greatly appreciated!
If your dtypes are really datetime's then it's really simple:
In [36]:
df['Difference Time'] = df['Exit Time'] - df['Entry Time']
df
Out[36]:
Entry Time Exit Time Difference Time
0 2014-08-01 00:30:59.555000 2014-08-01 06:30:59.555000 06:00:00
1 2014-08-01 00:56:43.200000 NaT NaT
2 2014-08-01 10:30:30.500000 2014-08-01 11:30:30.500000 01:00:00
[3 rows x 3 columns]
If they are not then you need to convert them using pd.to_datetime e.g.
df['Entry time'] = pd.to_datetime(df['Entry Time'])
EDIT
There seems to be some additional weirdness with your data which I don't quite understand but the following seems to have worked for you:
df.dropna()['Exit_Time'] - df.dropna()['Entry_Time']

Date ranges in Pandas

After fighting with NumPy and dateutil for days, I recently discovered the amazing Pandas library. I've been poring through the documentation and source code, but I can't figure out how to get date_range() to generate indices at the right breakpoints.
from datetime import date
import pandas as pd
start = date('2012-01-15')
end = date('2012-09-20')
# 'M' is month-end, instead I need same-day-of-month
date_range(start, end, freq='M')
What I want:
2012-01-15
2012-02-15
2012-03-15
...
2012-09-15
What I get:
2012-01-31
2012-02-29
2012-03-31
...
2012-08-31
I need month-sized chunks that account for the variable number of days in a month. This is possible with dateutil.rrule:
rrule(freq=MONTHLY, dtstart=start, bymonthday=(start.day, -1), bysetpos=1)
Ugly and illegible, but it works. How can do I this with pandas? I've played with both date_range() and period_range(), so far with no luck.
My actual goal is to use groupby, crosstab and/or resample to calculate values for each period based on sums/means/etc of individual entries within the period. In other words, I want to transform data from:
total
2012-01-10 00:01 50
2012-01-15 01:01 55
2012-03-11 00:01 60
2012-04-28 00:01 80
#Hypothetical usage
dataframe.resample('total', how='sum', freq='M', start='2012-01-09', end='2012-04-15')
to
total
2012-01-09 105 # Values summed
2012-02-09 0 # Missing from dataframe
2012-03-09 60
2012-04-09 0 # Data past end date, not counted
Given that Pandas originated as a financial analysis tool, I'm virtually certain that there's a simple and fast way to do this. Help appreciated!
freq='M' is for month-end frequencies (see here). But you can use .shift to shift it by any number of days (or any frequency for that matter):
pd.date_range(start, end, freq='M').shift(15, freq=pd.datetools.day)
There actually is no "day of month" frequency (e.g. "DOMXX" like "DOM09"), but I don't see any reason not to add one.
http://github.com/pydata/pandas/issues/2289
I don't have a simple workaround for you at the moment because resample requires passing a known frequency rule. I think it should be augmented to be able to take any date range to be used as arbitrary bin edges, also. Just a matter of time and hacking...
try
date_range(start, end, freq=pd.DateOffset(months=1))

Categories

Resources