Irregular, non-contiguous Periods in Pandas

Irregular, non-contiguous Periods in Pandas - python

I need to represent a sequence of events. These events are a little unusual in that they are:
non-contiguous
non-overlapping
irregular duration
For example:
1200 - 1203
1210 - 1225
1304 - 1502
I would like to represent these events using Pandas.PeriodIndex but I can't figure out how to create Period objects with irregular durations.
I have two questions:
Is there a way to create Period objects with irregular durations using existing Pandas functionality?
If not, could you suggest how to modify Pandas in order to provide irregular duration Period objects? (this comment suggests that it might be possible "using custom DateOffset classes with appropriately crafted onOffset, rollforward, rollback, and apply methods")
Notes
The docstring for Period suggests that it is possible to specify arbitrary durations like 5T for "5 minutes". I believe this docstring is incorrect. Running pd.Period('2013-01-01', freq='5T') produces an exception ValueError: Only mult == 1 supported. I have reported this issue.
The "time stamps vs time spans" section in the Pandas documentation states "For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans. Better support for irregular intervals with arbitrary start and end points are forth-coming in future releases." (my emphasis)
Update 1
Building a Period with a custom duration looks pretty straightforward. BUT I think the main stumbling block will be persuading PeriodIndex to accept Periods with different freqs. e.g.:
In [93]: pd.PeriodIndex([pd.Period('2000', freq='D'),
pd.Period('2001', freq='T')])
ValueError: 2001-01-01 00:00 is wrong freq
It looks like a central assumption in PeriodIndex is that every Period has the same freq.

A possible solution, depending on the application, is to bin your data by creating a PeriodIndex that has a period equal to the smallest unit of time resolution that you need in order to handle your data and then divide the data amongst the bins for each event, leaving the remaining bins null.

if you have a time period of minutes you must pass date time include minutes like follow:
pd.PeriodIndex([pd.Period('2000-01-01 00:00', freq='T'),
pd.Period('2001-01-01 00:00', freq='T')])
the result:
PeriodIndex(['2000-01-01 00:00', '2001-01-01 00:00'], dtype='period[T]', freq='T')

Related

Understanding Plotly Time Difference units

So, I have a problem similar to this question. I have a DataFrame with a column 'diff' and a column 'date' with the following dtypes:
delta_df['diff'].dtype
>>> dtype('<m8[ns]')
delta_df['date'].dtype
>>> datetime64[ns, UTC]
According to this answer, there are (kind of) equivalent. However, then I plot using plotly (I used histogram and scatter), the 'diff' axis has a weird unit. Something like 2T, 2.5T, 3T, etc, what is this? The data on 'diff' column looks like 0 days 00:29:36.000001 so I don't understand what is happening (column of 'date' is 2018-06-11 01:04:25.000005+00:00).
BTW, the diff column was generated using df['date'].diff().
So my question is:
What is this T? Is it a standard choosen by plotly like 30 mins and then 2T is 1 hour? if so, how to check the value of the chosen T?
Maybe more important, how to plot with the axis as it appears on the column so it's easier to read?

The "T" you see in the axis label of your plot represents a time unit, and in Plotly, it stands for "Time". By default, Plotly uses seconds as the time unit, but if your data spans more than a few minutes, it will switch to larger time units like minutes (T), hours (H), or days (D). This is probably what is causing the weird units you see in your plot.
It's worth noting that using "T" as a shorthand for minutes is a convention adopted by some developers and libraries because "M" is already used to represent months.
To confirm that the weird units you see are due to Plotly switching to larger time units, you can check the largest value in your 'diff' column. If the largest value is more than a few minutes, Plotly will switch to using larger time units.

how to combine 4D xarray data

I have a 4D xarray which contains time, lev, lat, and lon. The data is for specific day so that the length of time is 1. My goal is to use 4D xarray with same attributess but include a month data so that the time length will be 30.
I try to google it but cannot find useful information. I appreciate it if anyone can provide some insights.

If you have multiple points in a time series, you can use xr.DataArray.resample to change the frequency of a datetime dimension. Once you have resampled, you'll get a DataArrayResample object, to which you can apply any of the methods listed in the DataArrayResample API docs.
If you only have a single point in time, you can't resample to a higher frequency. Your best bet is probably to simply select and drop the time dim altogether, then use expand_dims to expand the dimensions again to include the full time dim you want. Just be careful because this overwrites the time dimension's values with whatever you want, regardless of what was in there before:
target_dates = pd.date_range('2018-08-01', '2018-08-30', freq='D')
daily = (
da
.isel(time=0, drop=True)
.expand_dims(time=target_dates)
)

Python xarray: Extract first and last time value within each month of a timeseries

EDIT 2016-01-24: This behavior was from a bug in xarray (at the time known as 'xray'). See answer by skc below.
I have an xarray.DataArray comprising daily data spanning multiple years. I want to compute the time tendency of that data for each month in the timeseries. I can get the numerator, i.e. the change in the quantity over each month, using resample. Supposing arr is my xarray.DataArray object, with the time coordinate named 'time':
data_first = arr.resample('1M', 'time' how='first')
data_last = arr.resample('1M', 'time' how='last')
Then data_last - data_first gives me the change in that variable over that month.
However, this doesn't work on the time=arr.time object itself: both 'first' and 'last' kwarg values yield the same value, which is the last day of that month. Also, I can't use the groupby methods, because doing so with time.month groups all the Januaries together, all the Februaries together, etc., when I want the first and last time value within each individual month in the timeseries.
Is there a simple way to do this in xarray? I suspect yes, but I'm new to the package and am failing miserably.

Since 'time' is a coordinate in the DataArray you provided, for the moment it is not possible1 preform resample directly upon it. A possible workaround is to create a new DataArray with the time coordinate values as a variable (still linked with the same coordinate 'time')
If arr is the DataArray you are starting from I would suggest something like this:
time = xray.DataArray(arr.time.values, coords=[arr.time.values], dims=['time'])
time_first = time.resample('1M', 'time', how='first')
time_last = time.resample('1M', 'time', how='last')
time_diff = time_last - time_first
1This is not the intended behavior -- see Stephan's comment above.
Update: Pull request 648 has fixed this issue, so there should no longer be a need to use a workaround.

Selecting data for one hour in a timeseries dataframe

I'm having trouble selecting data in a dataframe dependent on an hour.
I have a months worth of data which increases in 10min intervals.
I would like to be able to select the data (creating another dataframe) for each hour in a specific day for each hour. However, I am having trouble creating an expression.
This is how I did it to select the day:
x=all_data.resample('D').index
for day in range(20):
c=x.day[day]
d=x.month[day]
print data['%(a)s-%(b)s-2009' %{'a':c, 'b':d} ]
but if I do it for hour, it will not work.
x=data['04-09-2009'].resample('H').index
for hour in range(8):
daydata=data['4-9-2009 %(a)s' %{'a':x.hour[hour]}]
I get the error:
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named 4-9-2009 0'
which is true as it is in format dd/mm/yyy hh:mm:ss
I'm sure this should be easy and something to do with resample. The trouble is I don't want to do anything with the dat, just select the data frame (to correlate it afterwards)
Cheers

You don't need to resample your data unless you want to aggregate into a daily value (e.g., sum, max, median)
If you just want a specific day's worth of data, you can use to the follow example of the .loc attribute to get started:
import numpy
import pandas
N = 3700
data = numpy.random.normal(size=N)
time = pandas.DatetimeIndex(freq='10T', start='2013-02-15 14:30', periods=N)
ts = pandas.Series(data=data, index=time)
ts.loc['2013-02-16']
The great thing about using .loc on a time series is that you can be a general or specific as you want with the dates. So for a particular hour, you'd say:
ts.loc['2013-02-16 13'] # notice that i didn't put any minutes in there
Similarly, you can pull out a whole month with:
ts.loc['2013-02']
The issue you're having with the string formatting is that you're manually padding the string with a 0. So if you have a 2-digit hour (i.e. in the afternoon) you end up with a 3-digit representation of the hours (and that's not valid). SO if I wanted to loop through a specific set of hours, I would do:
hours = [2, 7, 12, 22]
for hr in hours:
print(ts.loc['2013-02-16 {0:02d}'.format(hr)])
The 02d format string tell python to construct a string from a digit (integer) that is least two characters wide and the pad the string with a 0 of the left side if necessary. Also you probably need to format your date as YYYY-mm-dd instead of the other way around.

Remove Holidays and Weekends in a very long time-serie, how to model time-series in Python?

Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)

I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).

Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.

I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']

Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.

If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.