So, I have a problem similar to this question. I have a DataFrame with a column 'diff' and a column 'date' with the following dtypes:
delta_df['diff'].dtype
>>> dtype('<m8[ns]')
delta_df['date'].dtype
>>> datetime64[ns, UTC]
According to this answer, there are (kind of) equivalent. However, then I plot using plotly (I used histogram and scatter), the 'diff' axis has a weird unit. Something like 2T, 2.5T, 3T, etc, what is this? The data on 'diff' column looks like 0 days 00:29:36.000001 so I don't understand what is happening (column of 'date' is 2018-06-11 01:04:25.000005+00:00).
BTW, the diff column was generated using df['date'].diff().
So my question is:
What is this T? Is it a standard choosen by plotly like 30 mins and then 2T is 1 hour? if so, how to check the value of the chosen T?
Maybe more important, how to plot with the axis as it appears on the column so it's easier to read?
The "T" you see in the axis label of your plot represents a time unit, and in Plotly, it stands for "Time". By default, Plotly uses seconds as the time unit, but if your data spans more than a few minutes, it will switch to larger time units like minutes (T), hours (H), or days (D). This is probably what is causing the weird units you see in your plot.
It's worth noting that using "T" as a shorthand for minutes is a convention adopted by some developers and libraries because "M" is already used to represent months.
To confirm that the weird units you see are due to Plotly switching to larger time units, you can check the largest value in your 'diff' column. If the largest value is more than a few minutes, Plotly will switch to using larger time units.
Related
I have a data set as follows:
[Time of notification], [Station], [Category]
2019-02-04 19.36:22, Location A, Alert
2019-02-04 20.06:35, Location B, Request
2019-02-05 07.04:53, Location A, Incident
Time of notification is in datetime64[ns] format. The time span is one year.
I am trying to get the following line graphs:
One per station
Time on x axis. Preferably: Accumulated for days of the week and hours (e.g. all Mondays, Tuesdays etc together, so that a daily/weekly trend over the whole year becomes visible).
Number of notifications (for that station) on the y axis. Category is irrelevant.
I have tried a lot, but I am new to time series and to visualization, and I am getting nowhere after hours of trying. I have been trying with plt.subplots, value_counts etcetera. Also tried making this graph for one station first, but even that didn't work out.
Can anyone help?
Thank you!
I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
As I am preparing to do some regressions on a rather big dataset I would like to visualize the data at first.
The data we are talking about is data about the New York subway (hourly entries, rain, weather and such) for May, 2011.
When creating the dataframe I converted hours and time to pandas datetime format.
Now I realize that what I want to do does not make much sense from a logical point of view for the example at hand. However, I would still like to plot the exact time of day against the hourly entries. Which, as I said, is not very meaningful since ENTRIESn_hourly is aggregated. But let's for the sake of the argument assume the ENTRIESn_hourly would be explicitly related to the exact timestamp.
Now how would I go about taking only the times and ignoring the dates and then plot that out?
Please find the jupyter notebook here: https://github.com/FBosler/Udacity/blob/master/Example.ipynb
Thx alot!
IIUC you can do it this way:
In [9]: weather_turnstile.plot.line(x=weather_turnstile.Date_Time.dt.time, y='ENTRIESn_hourly', marker='o', alpha=0.3)
Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0xc2a63c8>
.dt accessor gives you access to the following attributes:
In [10]: weather_turnstile.Date_Time.dt.
weather_turnstile.Date_Time.dt.ceil weather_turnstile.Date_Time.dt.is_quarter_end weather_turnstile.Date_Time.dt.strftime
weather_turnstile.Date_Time.dt.date weather_turnstile.Date_Time.dt.is_quarter_start weather_turnstile.Date_Time.dt.time
weather_turnstile.Date_Time.dt.day weather_turnstile.Date_Time.dt.is_year_end weather_turnstile.Date_Time.dt.to_period
weather_turnstile.Date_Time.dt.dayofweek weather_turnstile.Date_Time.dt.is_year_start weather_turnstile.Date_Time.dt.to_pydatetime
weather_turnstile.Date_Time.dt.dayofyear weather_turnstile.Date_Time.dt.microsecond weather_turnstile.Date_Time.dt.tz
weather_turnstile.Date_Time.dt.days_in_month weather_turnstile.Date_Time.dt.minute weather_turnstile.Date_Time.dt.tz_convert
weather_turnstile.Date_Time.dt.daysinmonth weather_turnstile.Date_Time.dt.month weather_turnstile.Date_Time.dt.tz_localize
weather_turnstile.Date_Time.dt.floor weather_turnstile.Date_Time.dt.nanosecond weather_turnstile.Date_Time.dt.week
weather_turnstile.Date_Time.dt.freq weather_turnstile.Date_Time.dt.normalize weather_turnstile.Date_Time.dt.weekday
weather_turnstile.Date_Time.dt.hour weather_turnstile.Date_Time.dt.quarter weather_turnstile.Date_Time.dt.weekday_name
weather_turnstile.Date_Time.dt.is_month_end weather_turnstile.Date_Time.dt.round weather_turnstile.Date_Time.dt.weekofyear
weather_turnstile.Date_Time.dt.is_month_start weather_turnstile.Date_Time.dt.second weather_turnstile.Date_Time.dt.year
I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA
For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).
I need to represent a sequence of events. These events are a little unusual in that they are:
non-contiguous
non-overlapping
irregular duration
For example:
1200 - 1203
1210 - 1225
1304 - 1502
I would like to represent these events using Pandas.PeriodIndex but I can't figure out how to create Period objects with irregular durations.
I have two questions:
Is there a way to create Period objects with irregular durations using existing Pandas functionality?
If not, could you suggest how to modify Pandas in order to provide irregular duration Period objects? (this comment suggests that it might be possible "using custom DateOffset classes with appropriately crafted onOffset, rollforward, rollback, and apply methods")
Notes
The docstring for Period suggests that it is possible to specify arbitrary durations like 5T for "5 minutes". I believe this docstring is incorrect. Running pd.Period('2013-01-01', freq='5T') produces an exception ValueError: Only mult == 1 supported. I have reported this issue.
The "time stamps vs time spans" section in the Pandas documentation states "For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans. Better support for irregular intervals with arbitrary start and end points are forth-coming in future releases." (my emphasis)
Update 1
Building a Period with a custom duration looks pretty straightforward. BUT I think the main stumbling block will be persuading PeriodIndex to accept Periods with different freqs. e.g.:
In [93]: pd.PeriodIndex([pd.Period('2000', freq='D'),
pd.Period('2001', freq='T')])
ValueError: 2001-01-01 00:00 is wrong freq
It looks like a central assumption in PeriodIndex is that every Period has the same freq.
A possible solution, depending on the application, is to bin your data by creating a PeriodIndex that has a period equal to the smallest unit of time resolution that you need in order to handle your data and then divide the data amongst the bins for each event, leaving the remaining bins null.
if you have a time period of minutes you must pass date time include minutes like follow:
pd.PeriodIndex([pd.Period('2000-01-01 00:00', freq='T'),
pd.Period('2001-01-01 00:00', freq='T')])
the result:
PeriodIndex(['2000-01-01 00:00', '2001-01-01 00:00'], dtype='period[T]', freq='T')