Slicing window on pandas dataframe - python

I have a pandas dataframe with time-series data in 1-min intervals. Is there a pythonic way to slice my data for every 15 min like this?
a=pd.DataFrame(index=pd.date_range('2017-01-01 00:04','2017-01-01 01:04',freq='1T'))
a['data']=np.arange(61)
for i in range(0,len(a),15):
print a[i:i+15]
Is there any built in function for this in pandas?

IIUC, use groups and pd.Grouper with freq=15min
for _, g in a.groupby(pd.Grouper(freq='15min')):
print(g)
Can also do
groups = a.groupby(pd.Grouper(freq='15min'))
list(groups)
Outputs
data
2017-01-01 00:04:00 0
2017-01-01 00:05:00 1
2017-01-01 00:06:00 2
2017-01-01 00:07:00 3
2017-01-01 00:08:00 4
2017-01-01 00:09:00 5
2017-01-01 00:10:00 6
2017-01-01 00:11:00 7
2017-01-01 00:12:00 8
2017-01-01 00:13:00 9
2017-01-01 00:14:00 10
data
2017-01-01 00:15:00 11
2017-01-01 00:16:00 12
2017-01-01 00:17:00 13
2017-01-01 00:18:00 14
2017-01-01 00:19:00 15
2017-01-01 00:20:00 16
2017-01-01 00:21:00 17
2017-01-01 00:22:00 18
2017-01-01 00:23:00 19
2017-01-01 00:24:00 20
2017-01-01 00:25:00 21
2017-01-01 00:26:00 22
2017-01-01 00:27:00 23
2017-01-01 00:28:00 24
2017-01-01 00:29:00 25
data
2017-01-01 00:30:00 26
2017-01-01 00:31:00 27
2017-01-01 00:32:00 28
2017-01-01 00:33:00 29
2017-01-01 00:34:00 30
2017-01-01 00:35:00 31
2017-01-01 00:36:00 32
2017-01-01 00:37:00 33
2017-01-01 00:38:00 34
2017-01-01 00:39:00 35
2017-01-01 00:40:00 36
2017-01-01 00:41:00 37
2017-01-01 00:42:00 38
2017-01-01 00:43:00 39
2017-01-01 00:44:00 40

Related

Resampling Hourly Data into Half Hourly in Pandas

I have the following DataFrame called prices:
DateTime PriceAmountGBP
0 2022-03-27 23:00:00 202.807890
1 2022-03-28 00:00:00 197.724150
2 2022-03-28 01:00:00 191.615328
3 2022-03-28 02:00:00 188.798436
4 2022-03-28 03:00:00 187.706682
... ... ...
19 2023-01-24 18:00:00 216.915400
20 2023-01-24 19:00:00 197.050516
21 2023-01-24 20:00:00 168.227992
22 2023-01-24 21:00:00 158.954200
23 2023-01-24 22:00:00 149.039322
I'm trying to resample prices to show Half Hourly data instead of Hourly, with PriceAmountGBP repeating on the half hour, desired output below:
DateTime PriceAmountGBP
0 2022-03-27 23:00:00 202.807890
1 2022-03-28 23:30:00 202.807890
2 2022-03-28 00:00:00 197.724150
3 2022-03-28 00:30:00 197.724150
4 2022-03-28 01:00:00 191.615328
... ... ...
19 2023-01-24 18:00:00 216.915400
20 2023-01-24 18:30:00 216.915400
21 2023-01-24 19:00:00 197.050516
22 2023-01-24 19:30:00 197.050516
23 2023-01-24 20:00:00 168.227992
I've attempted the below which is incorrect:
prices.set_index('DateTime').resample('30T').interpolate()
Output:
PriceAmountGBP
DateTime
2022-03-27 23:00:00 202.807890
2022-03-27 23:30:00 200.266020
2022-03-28 00:00:00 197.724150
2022-03-28 00:30:00 194.669739
2022-03-28 01:00:00 191.615328
... ...
2023-01-24 20:00:00 168.227992
2023-01-24 20:30:00 163.591096
2023-01-24 21:00:00 158.954200
2023-01-24 21:30:00 153.996761
2023-01-24 22:00:00 149.039322
Any help appreciated!
You want to resample without any transformation, and then do a so-called "forward fill" of the resulting null values.
That's:
result = (
prices.set_index('DateTime')
.resample('30T')
.asfreq() # no transformation
.ffill() # drag previous values down
)

Slicing Pandas based on threshold and timestamp before and after

I have a dataframe that looks as follows:
Timestamp (Index) Status value
2017-01-01 12:01:00 OPEN 83
2017-01-01 12:02:00 OPEN 82
2017-01-01 12:03:00 OPEN 87
2017-01-01 12:04:00 CLOSE 82
2017-01-01 12:05:00 CLOSE 81
2017-01-01 12:06:00 CLOSE 81
2017-01-01 12:07:00 CLOSE 81
2017-01-01 12:08:00 CLOSE 81
2017-01-01 12:09:00 CLOSE 81
2017-01-01 12:10:00 CLOSE 81
2017-01-01 12:11:00 CLOSE 81
2017-01-01 12:12:00 OPEN 81
2017-01-01 12:13:00 OPEN 81
2017-01-01 12:14:00 OPEN 81
2017-01-01 12:15:00 OPEN 81
2017-01-01 12:16:00 CLEAR 34
2017-01-01 12:17:00 CLOSE 23
2017-01-01 12:18:00 CLOSE 23
2017-01-01 12:19:00 CLOSE 75
2017-01-01 12:20:00 CLOSE 65
2017-01-01 12:21:00 CLOSE 72
2017-01-01 12:22:00 CLOSE 76
2017-01-01 12:23:00 CLOSE 77
2017-01-01 12:24:00 OPEN 87
2017-01-01 12:25:00 OPEN 87
2017-01-01 12:26:00 OPEN 87
2017-01-01 12:27:00 OPEN 87
2017-01-01 12:28:00 OPEN 87
2017-01-01 12:29:00 CLOSE 75
2017-01-01 12:30:00 CLOSE 75
2017-01-01 12:31:00 CLOSE 75
In case the first of the consecutive CLOSE-values is below 70 I want to delete the OPEN-block that cames before and the CLOSE-block with the value below 70. It should look like this:
Timestamp (Index) Status value
2017-01-01 12:01:00 OPEN 83
2017-01-01 12:02:00 OPEN 82
2017-01-01 12:03:00 OPEN 87
2017-01-01 12:04:00 CLOSE 82
2017-01-01 12:05:00 CLOSE 81
2017-01-01 12:06:00 CLOSE 81
2017-01-01 12:07:00 CLOSE 81
2017-01-01 12:08:00 CLOSE 81
2017-01-01 12:09:00 CLOSE 81
2017-01-01 12:10:00 CLOSE 81
2017-01-01 12:11:00 CLOSE 81
2017-01-01 12:24:00 OPEN 87
2017-01-01 12:25:00 OPEN 87
2017-01-01 12:26:00 OPEN 87
2017-01-01 12:27:00 OPEN 87
2017-01-01 12:28:00 OPEN 87
2017-01-01 12:29:00 CLOSE 75
2017-01-01 12:30:00 CLOSE 75
2017-01-01 12:31:00 CLOSE 75
Any idea on how I can get hold of the relevant Timestamps in order to remove those periods?
Try:
df[df.groupby((df.status.shift().bfill().ne(df.status) & df.status.eq('OPEN')).cumsum()).transform(min).value.ge(70)]
result:
status value
timestamp
2017-01-01 12:01:00 OPEN 83
2017-01-01 12:02:00 OPEN 82
2017-01-01 12:03:00 OPEN 87
2017-01-01 12:04:00 CLOSE 82
2017-01-01 12:05:00 CLOSE 81
2017-01-01 12:06:00 CLOSE 81
2017-01-01 12:07:00 CLOSE 81
2017-01-01 12:08:00 CLOSE 81
2017-01-01 12:09:00 CLOSE 81
2017-01-01 12:10:00 CLOSE 81
2017-01-01 12:11:00 CLOSE 81
2017-01-01 12:24:00 OPEN 87
2017-01-01 12:25:00 OPEN 87
2017-01-01 12:26:00 OPEN 87
2017-01-01 12:27:00 OPEN 87
2017-01-01 12:28:00 OPEN 87
2017-01-01 12:29:00 CLOSE 75
2017-01-01 12:30:00 CLOSE 75
2017-01-01 12:31:00 CLOSE 75
The method is to create groups where status is not equal to previous status, begining with status == 'OPEN'.
Then preserving rows where value is greater or equal 70 per group.

How do I resample a pandas Series using values around the hour

I have timeseries data recorded at 10min frequency. I want to average the values at one hour interval. But for that I want to take 3 values before the hour and 2 values after the hour, take the average and assign that value to the exact hour timestamp.
for example, I have the series
index = pd.date_range('2000-01-01T00:30:00', periods=63, freq='10min')
series = pd.Series(range(63), index=index)
series
2000-01-01 00:30:00 0
2000-01-01 00:40:00 1
2000-01-01 00:50:00 2
2000-01-01 01:00:00 3
2000-01-01 01:10:00 4
2000-01-01 01:20:00 5
2000-01-01 01:30:00 6
2000-01-01 01:40:00 7
2000-01-01 01:50:00 8
2000-01-01 02:00:00 9
2000-01-01 02:10:00 10
..
2000-01-01 08:50:00 50
2000-01-01 09:00:00 51
2000-01-01 09:10:00 52
2000-01-01 09:20:00 53
2000-01-01 09:30:00 54
2000-01-01 09:40:00 55
2000-01-01 09:50:00 56
2000-01-01 10:00:00 57
2000-01-01 10:10:00 58
2000-01-01 10:20:00 59
2000-01-01 10:30:00 60
2000-01-01 10:40:00 61
2000-01-01 10:50:00 62
Freq: 10T, Length: 63, dtype: int64
So, if I do
series.resample('1H').mean()
2000-01-01 00:00:00 1.0
2000-01-01 01:00:00 5.5
2000-01-01 02:00:00 11.5
2000-01-01 03:00:00 17.5
2000-01-01 04:00:00 23.5
2000-01-01 05:00:00 29.5
2000-01-01 06:00:00 35.5
2000-01-01 07:00:00 41.5
2000-01-01 08:00:00 47.5
2000-01-01 09:00:00 53.5
2000-01-01 10:00:00 59.5
Freq: H, dtype: float64
the first value is the average of 0, 1, 2, and assigned to hour 0, the second the average of the values for 1:00:00 to 1:50:00 assigned to 1:00:00 and so on.
What I would like to have is the first average centered at 1:00:00 calculated using values from 00:30:00 through 01:20:00, the second centered at 02:00:00 calculated from 01:30:00 to 02:20:00 and so on...
What will be the best way to do that?
Thanks!
You should be able to do that with:
series.index = series.index - pd.Timedelta(30, unit='m')
series_grouped_mean = series.groupby(pd.Grouper(freq='60min')).mean()
series_grouped_mean.index = series_grouped_mean.index + pd.Timedelta(60, unit='m')
series_grouped_mean
I got:
2000-01-01 01:00:00 2.5
2000-01-01 02:00:00 8.5
2000-01-01 03:00:00 14.5
2000-01-01 04:00:00 20.5
2000-01-01 05:00:00 26.5
2000-01-01 06:00:00 32.5
2000-01-01 07:00:00 38.5
2000-01-01 08:00:00 44.5
2000-01-01 09:00:00 50.5
2000-01-01 10:00:00 56.5
2000-01-01 11:00:00 61.0
Freq: H, dtype: float64

Pandas - Sum of first X hours of datetime index

I have a dataframe with a datetime index and 100 columns.
I want to have a new dataframe with the same datetime index and columns, but the values would contain the sum of the first 10 hours of each day.
So if I had an original dataframe like this:
A B C
---------------------------------
2018-01-01 00:00:00 2 5 -10
2018-01-01 01:00:00 6 5 7
2018-01-01 02:00:00 7 5 9
2018-01-01 03:00:00 9 5 6
2018-01-01 04:00:00 10 5 2
2018-01-01 05:00:00 7 5 -1
2018-01-01 06:00:00 1 5 -1
2018-01-01 07:00:00 -4 5 10
2018-01-01 08:00:00 9 5 10
2018-01-01 09:00:00 21 5 -10
2018-01-01 10:00:00 2 5 -1
2018-01-01 11:00:00 8 5 -1
2018-01-01 12:00:00 8 5 10
2018-01-01 13:00:00 8 5 9
2018-01-01 14:00:00 7 5 -10
2018-01-01 15:00:00 7 5 5
2018-01-01 16:00:00 7 5 -10
2018-01-01 17:00:00 4 5 7
2018-01-01 18:00:00 5 5 8
2018-01-01 19:00:00 2 5 8
2018-01-01 20:00:00 2 5 4
2018-01-01 21:00:00 8 5 3
2018-01-01 22:00:00 1 5 3
2018-01-01 23:00:00 1 5 1
2018-01-02 00:00:00 2 5 2
2018-01-02 01:00:00 3 5 8
2018-01-02 02:00:00 4 5 6
2018-01-02 03:00:00 5 5 6
2018-01-02 04:00:00 1 5 7
2018-01-02 05:00:00 7 5 7
2018-01-02 06:00:00 5 5 1
2018-01-02 07:00:00 2 5 2
2018-01-02 08:00:00 4 5 3
2018-01-02 09:00:00 6 5 4
2018-01-02 10:00:00 9 5 4
2018-01-02 11:00:00 11 5 5
2018-01-02 12:00:00 2 5 8
2018-01-02 13:00:00 2 5 0
2018-01-02 14:00:00 4 5 5
2018-01-02 15:00:00 5 5 4
2018-01-02 16:00:00 7 5 4
2018-01-02 17:00:00 -1 5 7
2018-01-02 18:00:00 1 5 7
2018-01-02 19:00:00 1 5 7
2018-01-02 20:00:00 5 5 7
2018-01-02 21:00:00 2 5 7
2018-01-02 22:00:00 2 5 7
2018-01-02 23:00:00 8 5 7
So for all rows with date 2018-01-01:
The value for column A would be 68 (2+6+7+9+10+7+1-4+9+21)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 22 (-10+7+9+6+2-1-1+10+10-10)
So for all rows with date 2018-01-02:
The value for column A would be 39 (2+3+4+5+1+7+5+2+4+6)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 46 (2+8+6+6+7+7+1+2+3+4)
The outcome would be:
A B C
---------------------------------
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I figured I'd group by date first and perform a sum and then merge the results based on the date. Is there a better/faster way to do this?
Thanks.
EDIT: I worked on this answer in the mean time:
df= df.between_time('0:00','9:00').groupby(pd.Grouper(freq='D')).sum()
df= df.resample('1H').ffill()
You need groupby df.index.date and use transfrom with lambda function to find sum of first 10 values as:
df.loc[:,['A','B','C']] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
Or if the sequence is the same for both grouped values and real columns
df.loc[:,:] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
print(df)
A B C
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46

How can I efficiently convert hourly data into dates and times for every day of the year using Python pandas?

I have a pandas DataFrame that represents a value for every hour of a day and I want to report each value of each day for a year. I have written the 'naive' way to do it. Is there a more efficient way?
Naive way (that works correctly, but takes a lot of time):
dfConsoFrigo = pd.read_csv("../assets/datas/refregirateur.csv", sep=';')
dataframe = pd.DataFrame(columns=['Puissance'])
iterator = 0
for day in pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H'):
iterator = iterator % 24
dataframe.loc[day] = dfConsoFrigo.iloc[iterator]['Puissance']
iterator += 1
Input (time;value) 24 rows:
Heure;Puissance
00:00;48.0
01:00;47.0
02:00;46.0
03:00;46.0
04:00;45.0
05:00;46.0
...
19:00;55.0
20:00;53.0
21:00;51.0
22:00;50.0
23:00;49.0
Expected Output (8760 rows):
Puissance
2017-01-01 00:00:00 48
2017-01-01 01:00:00 47
2017-01-01 02:00:00 46
2017-01-01 03:00:00 46
2017-01-01 04:00:00 45
...
2017-12-31 20:00:00 53
2017-12-31 21:00:00 51
2017-12-31 22:00:00 50
2017-12-31 23:00:00 49
I think you need numpy.tile:
np.random.seed(10)
df = pd.DataFrame({'Puissance':np.random.randint(100, size=24)})
rng = pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H')
df = pd.DataFrame({'a':np.tile(df['Puissance'].values, 365)}, index=rng)
print (df.head(30))
a
2017-01-01 00:00:00 9
2017-01-01 01:00:00 15
2017-01-01 02:00:00 64
2017-01-01 03:00:00 28
2017-01-01 04:00:00 89
2017-01-01 05:00:00 93
2017-01-01 06:00:00 29
2017-01-01 07:00:00 8
2017-01-01 08:00:00 73
2017-01-01 09:00:00 0
2017-01-01 10:00:00 40
2017-01-01 11:00:00 36
2017-01-01 12:00:00 16
2017-01-01 13:00:00 11
2017-01-01 14:00:00 54
2017-01-01 15:00:00 88
2017-01-01 16:00:00 62
2017-01-01 17:00:00 33
2017-01-01 18:00:00 72
2017-01-01 19:00:00 78
2017-01-01 20:00:00 49
2017-01-01 21:00:00 51
2017-01-01 22:00:00 54
2017-01-01 23:00:00 77
2017-01-02 00:00:00 9
2017-01-02 01:00:00 15
2017-01-02 02:00:00 64
2017-01-02 03:00:00 28
2017-01-02 04:00:00 89
2017-01-02 05:00:00 93

Categories

Resources