Count the average number per day - python

I need to count the average number of data points that occur per day. But I can't figure out how to write the code for it in python. The data below is an example of what the data looks like. It is a ndarray and uses panda datetime. The expected values I would get is 01-01 would have 2 per day, 01-02 would have 1 per day, and 01-03 would have 2 per day.
temp time = array([ Timestamp('1979-01-01 11:21:59.904000'),
Timestamp('1979-01-01 19:59:00.096000'),
Timestamp('1979-01-02 07:54:59.904000'),
Timestamp('1979-01-03 01:03:00'),
Timestamp('1979-01-03 07:41:59.712000')]

If I understand you right, you want to use pd.Grouper with frequency set to 'D'.
For example:
time = np.array([pd.Timestamp('1979-01-01 11:21:59.904000'),
pd.Timestamp('1979-01-01 19:59:00.096000'),
pd.Timestamp('1979-01-02 07:54:59.904000'),
pd.Timestamp('1979-01-03 01:03:00'),
pd.Timestamp('1979-01-03 07:41:59.712000')])
df = pd.DataFrame({'time':time})
print( df.groupby(pd.Grouper(key='time', freq='D'))['time'].count() )
Prints:
time
1979-01-01 2
1979-01-02 1
1979-01-03 2
Freq: D, Name: time, dtype: int64

Related

Date frequency detection, but one that occurs most often

I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]

Counting consecutive days of temperature data

So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.

Python count the number of hours in each day from datetime index

From dataframe below, I want to count the number of hours in each single day. Each record weights 5 minutes. DateTime is TZ-aware.
Year data
Timestamp
2008-11-13 16:50:00+09:30 177.83
2008-11-13 16:55:00+09:30 165.73
2008-11-15 17:00:00+09:30 160.34
2008-11-15 17:15:00+09:30 148.90
2008-11-15 17:40:00+09:30 113.66
2008-11-20 17:15:00+09:30 121.12
2008-11-20 17:20:00+09:30 109.55
2008-11-20 17:35:00+09:30 100.86
2008-11-20 17:50:00+09:30 90.72
2008-11-20 07:55:00+09:30 86.27
The expected result is
Year hrs/day
Timestamp
2008-11-13 00:00:00+09:30 0.16666666666666666 # <-- 10 min / 60
2008-11-15 00:00:00+09:30 0.25 # <-- 15 min / 60
2008-11-20 00:00:00+09:30 0.4166666666666667 # <-- 15 min / 60
This is what I did.
df['Hour'] = df.index.hour.astype(int)
days = df.resample('D').apply({'Hour':'count'})
which gives me a column 'Hour' with values are the number of record per day.
Next...
days['Hr/dy'] = (days['Hour'] * 5.0)/60.0
where '5.0' is the timestamp interval. With this way, I can get the expected result.
But, I must switch between many data frames with different timestamp intervals. Providing the interval one-by-one every time I switch to a new data is not convenient. I need to get the timestamp interval automatically from the timestamp index.
freqdays = pd.infer_freq(df.index[0:10])
gives a non integer timestamp frequency ('5T') whis is not usable for mathematical operations to further get the hours.
What I need is either:
- a method to get the frequency (interval) from the timestamp index in integer or float, or
- to calculate the length of hours per day directly from the timestamp index.
Edit:
The original data has 5 minute interval with many missing records. The start and end hour is different from day to day.
you can try to get the minimum difference in seconds in your index with:
print (df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds())
300.0
so to get your result, do the groupby per day, multiply by the min difference of index, and divide by 3600 to get in hours:
df_agg = df.groupby(df.index.date).count()\
*df.index.to_series(keep_tz=True).sort_values().diff().min().total_seconds()/3600
print (df_agg)
date
2008-11-13 0.166667
2008-11-15 0.250000
2008-11-20 0.416667

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

How to call previous two week same day value in python

I am trying to fetch previous week same day data and then take an average of the value ("current_demand") for today's forecast (predict).
for example:
Today is Monday, so then I want to fetch data from the last two weeks Monday's data same time or block and then take an average of the value ["current_demand"] to predict today's value.
Input Data:
current_demand Date Blockno weekday
18839 01-06-2018 1 4
18836 01-06-2018 2 4
12256 02-06-2018 1 5
12266 02-06-2018 2 5
17957 08-06-2018 1 4
17986 08-06-2018 2 4
18491 09-06-2018 1 5
18272 09-06-2018 2 5
Expecting result:
18398 15-06-2018 1 4
something like that. I want to take same value, same block and same day of the previous two-week value then calculate for next value average.
I have tried some thing:
def forecast(DATA):
df = DATA
day = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
df.friday = day - timedelta(days=day.weekday() + 3)
print df
forecast(DATA)
Please suggest me something. Thank you in advance
I like relativedelta for this kind of job
from dateutil.relativedelta import relativedelta
(datetime.datetime.today() + relativedelta(weeks=-2)).date()
Output:
datetime.date(2018, 7, 23)
without the actual structure of your df it's hard to provide a solution tailored to your needs

Categories

Resources