Pandas: Calculate average of values for a time frame - python

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.

Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Related

Pandas resampling data with bigger interval than a whole index range

Situation
I have the folowwing pandas timeseries data:
date
predicted1
2001-03-13
0.994756
2005-08-22
0.551661
2000-05-07
0.001396
I need to take into account a case of resampling into bigger interval than a 5 years, for e.g. 10 years:
sample = data.set_index(pd.DatetimeIndex(data['date'])).drop('date', axis=1)['predicted1']
sample.resample('10Y').sum()
I get the following:
date
2000-12-31
0.001396
2010-12-31
1.546418
So resampling function groups data for the first year and separetely for other years.
Question
How to group all data to the 10 year interval? I want to get smth like this:
date
2000-12-31
1.5478132011506138
You can change the reference, closing and label in resample:
sample.resample('10Y', origin=sample.index.min(), closed='left', label='left').sum()
Output:
date
1999-12-31 1.547813
Freq: 10A-DEC, Name: predicted1, dtype: float64

Counting consecutive days of temperature data

So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

Python count the number of hours in each day from datetime index

From dataframe below, I want to count the number of hours in each single day. Each record weights 5 minutes. DateTime is TZ-aware.
Year data
Timestamp
2008-11-13 16:50:00+09:30 177.83
2008-11-13 16:55:00+09:30 165.73
2008-11-15 17:00:00+09:30 160.34
2008-11-15 17:15:00+09:30 148.90
2008-11-15 17:40:00+09:30 113.66
2008-11-20 17:15:00+09:30 121.12
2008-11-20 17:20:00+09:30 109.55
2008-11-20 17:35:00+09:30 100.86
2008-11-20 17:50:00+09:30 90.72
2008-11-20 07:55:00+09:30 86.27
The expected result is
Year hrs/day
Timestamp
2008-11-13 00:00:00+09:30 0.16666666666666666 # <-- 10 min / 60
2008-11-15 00:00:00+09:30 0.25 # <-- 15 min / 60
2008-11-20 00:00:00+09:30 0.4166666666666667 # <-- 15 min / 60
This is what I did.
df['Hour'] = df.index.hour.astype(int)
days = df.resample('D').apply({'Hour':'count'})
which gives me a column 'Hour' with values are the number of record per day.
Next...
days['Hr/dy'] = (days['Hour'] * 5.0)/60.0
where '5.0' is the timestamp interval. With this way, I can get the expected result.
But, I must switch between many data frames with different timestamp intervals. Providing the interval one-by-one every time I switch to a new data is not convenient. I need to get the timestamp interval automatically from the timestamp index.
freqdays = pd.infer_freq(df.index[0:10])
gives a non integer timestamp frequency ('5T') whis is not usable for mathematical operations to further get the hours.
What I need is either:
- a method to get the frequency (interval) from the timestamp index in integer or float, or
- to calculate the length of hours per day directly from the timestamp index.
Edit:
The original data has 5 minute interval with many missing records. The start and end hour is different from day to day.
you can try to get the minimum difference in seconds in your index with:
print (df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds())
300.0
so to get your result, do the groupby per day, multiply by the min difference of index, and divide by 3600 to get in hours:
df_agg = df.groupby(df.index.date).count()\
*df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds()/3600
print (df_agg)
date
2008-11-13 0.166667
2008-11-15 0.250000
2008-11-20 0.416667

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

Categories

Resources