Pandas aggregation on timedelta and its behaviour

Pandas aggregation on timedelta and its behaviour - python

I am struggling to do aggregation on timedelta including plotting. The raw data is available here
Essentially the data has a submit (datetime) , resolved (datetime) , PauseTime (timedelta) and Resolved-Submit-Pause ( which is the actual time to resolve )
click here for data
test_df = pd.read_csv('test_df.csv')
#convert to date time stamps
test_df[['Submit','Resolved']] = test_df[['Submit','Resolved']].apply(pd.to_datetime)
#CONVERT PauseTime and Resolved-Submit-Pausetime to Timedelta
test_df['PauseTime']=pd.to_timedelta(test_df['PauseTime'])
test_df['Resolved-Submit-Pausetime'] = pd.to_timedelta(test_df['Resolved-Submit-Pausetime'])
I am trying to aggregate mean for each day of 'Resolved'
test_df.groupby([pd.Grouper(key='Resolved', freq='D')])['Resolved-Submit-Pausetime'].mean()
which gives me an error - 'DataError: No numeric types to aggregate'
1) How can I aggregate on mean .
2) Also some guidance for plotting trend of the mean time to resolve (x axis will have all the dates and y axis agg mean timedelta of 'Resolved-Submit-Pausetime')

Use this step to convert your time delta column into seconds:
test_df['Resolved-Submit-Pausetime'] = test_df['Resolved-Submit-Pausetime'].astype('timedelta64[s]')
0 1234.0
1 27380.0
2 33017.0
3 5454.0
4 433.0
5 2302.0
6 21753.0
7 3405.0
8 4779.0
9 3974.0
10 3389.0
11 114.0
Name: Resolved-Submit-Pausetime, dtype: float64
Then run your groupby statement to compute the mean:
test_df.groupby([pd.Grouper(key='Resolved', freq='D')])['Resolved-Submit-Pausetime'].mean()
Resolved
2017-04-01 20543.666667
2017-04-02 7485.500000
2017-04-03 3132.200000
Name: Resolved-Submit-Pausetime, dtype: float64
You can use Pandas built in plotting tools to do a quick and dirty plot of mean time with respect to the groupby day:
test_df.groupby([pd.Grouper(key='Resolved', freq='D')])['Resolved-Submit-Pausetime'].mean().plot()

Related

How to work around the date range limit in Pandas for plotting?

sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks

the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!

You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.

Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.

I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

How to select data from the last four complete quarters of a timeseries with Pandas?

Say I've got a dataframe with a datetime index which covers the last financial year and one day in the current financial year (starting on April 1) :
Units
date
2016-01-01 8734
2016-06-30 6120
2016-09-30 7346
2016-12-31 5925
2016-03-31 7542
2016-06-30 9916
2016-09-30 9547
2016-12-31 8063
2017-01-01 7000
2017-03-31 5672
2017-04-01 7856
I'd like to be able to select the data for the last complete four quarters - in this case ignoring the first and last rows.
I know I can do this with slicing, thus:
df["2016-04-01":"2017-03-31"]
What's the most elegant - pythonic - solution to filter the data according to the last four complete quarters programmatically?

You should first define your quarters. You can use pd.period_range for that with the correct freq :
example :
quarters = pd.period_range('2016Q1', '2017Q1', freq='Q-MAR')
This would give you a PeriodIndex on which you can change the frequency to get the dates you want with asfreq :
quarters.asfreq('D', 'E')
That would give you the PeriodIndex that you can use to slice your Index.
Here are more example in the documentation.

pandas.DatetimeIndex.quarter Might also be useful.
And then you can use groupby to aggregate easily.

Using Alex's pointer to the DateOffset functionality in Pandas I found a partial solution, as well as the datetime module:
import datetime
from pandas.tseries.offsets import *
now = datetime.datetime.now()
start_year = (now - BQuarterEnd(n=1) - (12 * MonthBegin())).to_datetime()
end_year = (now - BQuarterEnd(n=1) ).to_datetime()
df[start_year.strftime("%Y-%m-%d") : end_year.strftime("%Y-%m-%d")]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas aggregation on timedelta and its behaviour - python

Related

How to work around the date range limit in Pandas for plotting?

How can I filter for rows one hour before and after a set timestamp in Python?

Pandas: Calculate average of values for a time frame

Propagate dates pandas and interpolate

How to select data from the last four complete quarters of a timeseries with Pandas?

Categories

Resources