Python count the number of hours in each day from datetime index - python

From dataframe below, I want to count the number of hours in each single day. Each record weights 5 minutes. DateTime is TZ-aware.
Year data
Timestamp
2008-11-13 16:50:00+09:30 177.83
2008-11-13 16:55:00+09:30 165.73
2008-11-15 17:00:00+09:30 160.34
2008-11-15 17:15:00+09:30 148.90
2008-11-15 17:40:00+09:30 113.66
2008-11-20 17:15:00+09:30 121.12
2008-11-20 17:20:00+09:30 109.55
2008-11-20 17:35:00+09:30 100.86
2008-11-20 17:50:00+09:30 90.72
2008-11-20 07:55:00+09:30 86.27
The expected result is
Year hrs/day
Timestamp
2008-11-13 00:00:00+09:30 0.16666666666666666 # <-- 10 min / 60
2008-11-15 00:00:00+09:30 0.25 # <-- 15 min / 60
2008-11-20 00:00:00+09:30 0.4166666666666667 # <-- 15 min / 60
This is what I did.
df['Hour'] = df.index.hour.astype(int)
days = df.resample('D').apply({'Hour':'count'})
which gives me a column 'Hour' with values are the number of record per day.
Next...
days['Hr/dy'] = (days['Hour'] * 5.0)/60.0
where '5.0' is the timestamp interval. With this way, I can get the expected result.
But, I must switch between many data frames with different timestamp intervals. Providing the interval one-by-one every time I switch to a new data is not convenient. I need to get the timestamp interval automatically from the timestamp index.
freqdays = pd.infer_freq(df.index[0:10])
gives a non integer timestamp frequency ('5T') whis is not usable for mathematical operations to further get the hours.
What I need is either:
- a method to get the frequency (interval) from the timestamp index in integer or float, or
- to calculate the length of hours per day directly from the timestamp index.
Edit:
The original data has 5 minute interval with many missing records. The start and end hour is different from day to day.

you can try to get the minimum difference in seconds in your index with:
print (df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds())
300.0
so to get your result, do the groupby per day, multiply by the min difference of index, and divide by 3600 to get in hours:
df_agg = df.groupby(df.index.date).count()\
*df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds()/3600
print (df_agg)
date
2008-11-13 0.166667
2008-11-15 0.250000
2008-11-20 0.416667

Related

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Count the average number per day

I need to count the average number of data points that occur per day. But I can't figure out how to write the code for it in python. The data below is an example of what the data looks like. It is a ndarray and uses panda datetime. The expected values I would get is 01-01 would have 2 per day, 01-02 would have 1 per day, and 01-03 would have 2 per day.
temp time = array([ Timestamp('1979-01-01 11:21:59.904000'),
Timestamp('1979-01-01 19:59:00.096000'),
Timestamp('1979-01-02 07:54:59.904000'),
Timestamp('1979-01-03 01:03:00'),
Timestamp('1979-01-03 07:41:59.712000')]
If I understand you right, you want to use pd.Grouper with frequency set to 'D'.
For example:
time = np.array([pd.Timestamp('1979-01-01 11:21:59.904000'),
pd.Timestamp('1979-01-01 19:59:00.096000'),
pd.Timestamp('1979-01-02 07:54:59.904000'),
pd.Timestamp('1979-01-03 01:03:00'),
pd.Timestamp('1979-01-03 07:41:59.712000')])
df = pd.DataFrame({'time':time})
print( df.groupby(pd.Grouper(key='time', freq='D'))['time'].count() )
Prints:
time
1979-01-01 2
1979-01-02 1
1979-01-03 2
Freq: D, Name: time, dtype: int64

pandas: a rolling window of hour-of-day average

related to: daily data, resample every 3 days, calculate over trailing 5 days efficiently but the summing is over strided non-consecutive data.
I have an hourly time series. For every hour I would like to have the average of the same hour of the day, in the last preceding 10 days window. E.g. at 2019-08-14 23:00, I would like to have an average of all 23:00 data from 2019-08-04 till 2019-08-13.
Is there an efficient way to do so in pandas/numpy? Or should I roll my sleeves and write my own loops and data structures?
Extra points: if today is a workday (Mon-Fri), the average should be for the previous 10 workdays. If it's a weekend (Sat-Sun), for the previous 10 weekend-days (span about 2.5 months)

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

Categories

Resources