I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().
Timestamp Power [W]
2022-05-05 06:00:05+02:00 2.0
2022-05-05 06:00:15+02:00 1.2
2022-05-05 06:00:25+02:00 0.3
2022-05-05 06:00:35+02:00 4.3
2022-05-05 06:00:45+02:00 1.1
...
2022-05-06 20:59:19+02:00 1.4
2022-05-06 20:59:29+02:00 2.0
2022-05-06 20:59:39+02:00 4.1
2022-05-06 20:59:49+02:00 1.3
2022-05-06 20:59:59+02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00+02:00 some values
2022-05-05 08:00:00+02:00 .
2022-05-05 09:00:00+02:00 .
2022-05-05 10:00:00+02:00 .
2022-05-05 11:00:00+02:00
...
2022-05-06 20:00:00+02:00
2022-05-06 21:00:00+02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.
Thank you.
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?
This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.
It will create a Stairs object (which represents a step function) from a pandas.Series, then bin across arbitrary DatetimeIndex values, then integrate.
This solution requires staircase 2.4.2 or above
setup
df = pd.DataFrame(
{
"Timestamp":pd.to_datetime(
[
"2022-05-05 06:00:05+02:00",
"2022-05-05 06:00:15+02:00",
"2022-05-05 06:00:25+02:00",
"2022-05-05 06:00:35+02:00",
"2022-05-05 06:00:45+02:00",
]
),
"Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
}
)
solution
import staircase as sc
# create step function
sf = sc.Stairs.from_values(
initial_value=0,
values=df.set_index("Timestamp")["Power [W]"],
)
# optional: plot
sf.plot(style="hlines")
# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
"2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)
# slice into bins and integrate
result = sf.slice(bins).integral()
result will be a pandas.Series with an IntervalIndex and Timedelta values. The IntervalIndex retains timezone info, it just doesn't display it:
[2022-05-05 06:00:00, 2022-05-05 06:00:20) 0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40) 0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00) 0 days 00:00:38
dtype: timedelta64[ns]
You can change the index to be the "left" values (and see this timezone info) like this:
result.index = result.index.left
You can change values to a float with division by an appropriate Timedelta. Eg to convert to minutes:
result/pd.Timedelta("1min")
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Related
I am trying to calculate the regression coefficient of weight for every animal_id and cycle_nr in my df:
animal_id
cycle_nr
feed_date
weight
1003
8
2020-02-06
221
1003
8
2020-02-10
226
1003
8
2020-02-14
230
1004
1
2020-02-20
231
1004
1
2020-02-21
243
What I tried using this source source:
import pandas as pd
import statsmodels.api as sm
def GroupRegress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])
This code fails because my variable includes a date.
What I tried next:
I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:
animal_id
cycle_nr
feed_date
weight
id
1003
8
2020-02-06
221
1
1003
8
2020-02-10
226
2
1003
8
2020-02-14
230
3
1004
1
2020-02-20
231
4
1004
1
2020-02-21
243
5
Then I ran my regression on this column
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])
The slope calculation looks good, but the intercept makes of course no sense.
Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.
I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates.
If the feed dates are strings make a datetime Series using pandas.to_datetime.
Use that new Series to calculate the actual time difference between feedings
Use the resultant timedeltas in your regression instead of a linear fabricated sequence. The timedeltas have different attributes, (i.e. microseconds, days), that can be used depending on the resolution you need.
My first instinct would be to produce the Timedeltas for each group separately. The first feeding in each group would of course be time zero.
Making the Timedeltas may not even be necessary - there are probably datetime aware regression methods in Numpy or Scipy or maybe even Pandas - I imagine there would have to be, it is a common enough application.
Instead of Timedeltas the datetime Series could be converted to ordinal values for use in the regression.
df = pd.DataFrame(
{
"feed_date": [
"2020-02-06",
"2020-02-10",
"2020-02-14",
"2020-02-20",
"2020-02-21",
]
}
)
>>> q = pd.to_datetime(df.feed_date)
>>> q
0 2020-02-06
1 2020-02-10
2 2020-02-14
3 2020-02-20
4 2020-02-21
Name: feed_date, dtype: datetime64[ns]
>>> q.apply(pd.Timestamp.toordinal)
0 737461
1 737465
2 737469
3 737475
4 737476
Name: feed_date, dtype: int64
>>>
So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.
I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64
I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.
I have pandas data frame which looks like below with 30 days in each month. Now I would like to convert this data frame into the regular Julian days calendar and put NA in those days with the missing date (for eg 1/31/2001: NA and so on) and interpolate later. Can any one suggest me the option to handle in pandas ?
Date X
1/1/2001 30.56787109
1/2/2001 29.57751465
1/3/2001 30.38424683
1/4/2001 28.64764404
1/5/2001 27.54763794
......
......
1/29/2001 27.44857788
1/30/2001 27.16296387
2/1/2001 28.02816772
2/2/2001 28.28137207
2/3/2001 28.38671875
.......
.......
02/29/2001 32.23730469
02/30/2001 32.56161499
3/1/2001 31.38146973
3/2/2001 30.73623657
3/3/2001 30.81912231
......
3/28/2001 33.7562561
3/29/2001 34.46350098
3/30/2001 33.49130249
4/1/2001 30.91223145
4/2/2001 30.94335938
.....
4/30/2001 30.02526855
......
......
12/29/2001 27.44161987
12/30/2001 28.43597412
So, I'm assuming that your Date column is just a string and is not an index. And I'm also replacing X with an integer value to make it easier to track what's happening to it. So first, convert to datetime, and set as index.
>>> df.Date=pd.to_datetime(df.Date,errors='coerce')
>>> df = df.set_index('Date')
2001-02-27 10
2001-02-28 11
NaT 12
NaT 13
2001-03-01 14
2001-03-02 15
So that uses python/pandas built in time awareness to identify invalid dates (Feb 29 in a non-leap year and Feb 30 in any year).
Then you can just resample to get the index onto a valid calendar. You also have some fill options (besides the default NaN) with resample or you can interpolate later on.
>>> df=df.resample('d')
2001-01-29 3
2001-01-30 4
2001-01-31 NaN
2001-02-01 5
2001-02-02 6
...
2001-02-27 10
2001-02-28 11
2001-03-01 14
2001-03-02 15
First, set the column type as a pandas.Datetimeindex and then use the to_julian_date() function. You can then use the interpolate() method to get the in between dates that are missing.
Source:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.to_julian_date.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html