How to add values occurring between 2 consecutive hours? - python

I have a df as follows:
dates values
2020-01-01 00:15:00 87.321
2020-01-01 00:30:00 87.818
2020-01-01 00:45:00 88.514
2020-01-01 01:00:00 89.608
2020-01-01 01:15:00 90.802
2020-01-01 01:30:00 91.896
2020-01-01 01:45:00 92.393
2020-01-01 02:00:00 91.995
2020-01-01 02:15:00 90.504
2020-01-01 02:30:00 88.216
2020-01-01 02:45:00 85.929
2020-01-01 03:00:00 84.238
I want to just keep hourly values when the minute is 00 and the values occurring before it must be added.
Example: For finding the value at 2020-01-01 01:00:00, the values from 2020-01-01 00:15:00 to 2020-01-01 01:00:00 should be added (87.321+87.818+88.514+59.608 = 353.261). Similarly, for finding the value at 2020-01-01 02:00:00, the values from 2020-01-01 01:15:00 to 2020-01-01 02:00:00 should be added (90.802+91.896+92.393+91.995 = 348.887)
Desired output
dates values
2020-01-01 01:00:00 353.261
2020-01-01 02:00:00 348.887
2020-01-01 03:00:00 333.67
I used df['dates'].dt.minute.eq(0) to obtain the boolean masking, but I am unable to find a way to add them.
Thanks in advance

hourly = df.set_index('dates') \ # Set the dates as index
.resample('1H', closed='right', label='right') \ # Resample, so that you have one value for each hour
.sum() # Set the sum of values as new value
hourly = hourly.reset_index() # If you want to have the dates as column again

Related

Matplotlib not shown x tick labels

I have a dataframe as follows (reproducible data):
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(365)
rows = 2
start_date=datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
data = np.random.uniform(2.1, 6.5, size=(rows, cols))
index = pd.bdate_range(start_date, freq='H', periods=rows)
df = pd.DataFrame(data=data, index=index, columns=['Ta']) #Ta: Room temperature
Ta
2020-01-01 00:00:00 6.242405
2020-01-01 01:00:00 4.923052
2020-01-01 02:00:00 5.112286
2020-01-01 03:00:00 4.689673
2020-01-01 04:00:00 4.493104
2020-01-01 05:00:00 3.719512
2020-01-01 06:00:00 5.473153
2020-01-01 07:00:00 3.442055
2020-01-01 08:00:00 4.045178
2020-01-01 09:00:00 2.585951
2020-01-01 10:00:00 4.028845
2020-01-01 11:00:00 5.411510
2020-01-01 12:00:00 3.383155
2020-01-01 13:00:00 5.997180
2020-01-01 14:00:00 6.485442
2020-01-01 15:00:00 4.240901
2020-01-01 16:00:00 3.637405
2020-01-01 17:00:00 2.766216
2020-01-01 18:00:00 6.024569
2020-01-01 19:00:00 5.503587
2020-01-01 20:00:00 5.532941
2020-01-01 21:00:00 4.251602
2020-01-01 22:00:00 4.444596
2020-01-01 23:00:00 2.935362
I'm trying to plot temperature along the entire day, but I can’t see the ticks marks with the specific date. Only the first one appears and I want to see every tick.
Here's the code:
df['Ta'].plot(figsize=(20,12),legend=True,subplots=True,ylim=(0,12),
xticks=list(df.index.values),fontsize=10,grid=True,
rot=0, xlim=(pd.Timestamp('2020-01-01 00:00:00'),pd.Timestamp('2020-01-01 23:00:00')))
Plot
I have tried everything that is on my hands, but I can't figure it out.
Use matplotlib annotation to attach labels to the chart:
data="""Date,Ta
2020-01-01 00:00:00,6.242405
2020-01-01 01:00:00,4.923052
2020-01-01 02:00:00,5.112286
2020-01-01 03:00:00,4.689673
2020-01-01 04:00:00,4.493104
2020-01-01 05:00:00,3.719512
2020-01-01 06:00:00,5.473153
2020-01-01 07:00:00,3.442055
2020-01-01 08:00:00,4.045178
2020-01-01 09:00:00,2.585951
2020-01-01 10:00:00,4.028845
2020-01-01 11:00:00,5.411510
2020-01-01 12:00:00,3.383155
2020-01-01 13:00:00,5.997180
2020-01-01 14:00:00,6.485442
2020-01-01 15:00:00,4.240901
2020-01-01 16:00:00,3.637405
2020-01-01 17:00:00,2.766216
2020-01-01 18:00:00,6.024569
2020-01-01 19:00:00,5.503587
2020-01-01 20:00:00,5.532941
2020-01-01 21:00:00,4.251602
2020-01-01 22:00:00,4.444596
2020-01-01 23:00:00,2.935362
"""
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(20,12), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.xlim([pd.Timestamp('2020-01-01 00:00:00'),pd.Timestamp('2020-01-01 23:00:00')])
plt.ylim(0,12)
items=range(0,len(df))
for index in items:
y2=y[index]
x2=x[index]
value="{:.2f}".format(y2)
plt.annotate(xy=[x2,y2],s=value)
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
df = pd.read_csv(StringIO(data), sep=',', index_col=0,parse_dates=['Date'])
plot_df(df, x=df.index, y=df.Ta, title='Temparature')
Filter your dataframe before the plot:
df.loc[df.index.normalize() == '2020-01-01'] \
.plot(figsize=(20,12), legend=True, subplots=True,
ylim=(0,12), fontsize=10, grid=True)

Split time series in intervals of non-uniform length

I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]

How to extract hourly data from a df in python?

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

Using pandas.resample().agg() with 'interpolate'

i need to resample a df, different columns with different functions.
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(start='2020-01-01 00:00:00', end='2020-01-02 00:00:00', freq='3H'),
data=np.random.rand(9,3),
columns=['A','B','C'])
df = df.resample('1H').agg({'A': 'ffill',
'B': 'interpolate',
'C': 'max'})
Functions like 'mean', 'max', 'sum' work.
But it seems that 'interpolate' can't be used like that.
Any workarounds?
One work-around would be to concat the different aggregated frames:
df_out = pd.concat([df[['A']].resample('1H').ffill(),
df[['B']].resample('1H').interpolate(),
df[['C']].resample('1H').max().ffill()],
axis=1)
[out]
A B C
2020-01-01 00:00:00 0.836547 0.436186 0.520913
2020-01-01 01:00:00 0.836547 0.315646 0.520913
2020-01-01 02:00:00 0.836547 0.195106 0.520913
2020-01-01 03:00:00 0.577291 0.074566 0.754697
2020-01-01 04:00:00 0.577291 0.346092 0.754697
2020-01-01 05:00:00 0.577291 0.617617 0.754697
2020-01-01 06:00:00 0.490666 0.889143 0.685191
2020-01-01 07:00:00 0.490666 0.677584 0.685191
2020-01-01 08:00:00 0.490666 0.466025 0.685191
2020-01-01 09:00:00 0.603678 0.254466 0.605424
2020-01-01 10:00:00 0.603678 0.358240 0.605424
2020-01-01 11:00:00 0.603678 0.462014 0.605424
2020-01-01 12:00:00 0.179458 0.565788 0.596706
2020-01-01 13:00:00 0.179458 0.477367 0.596706
2020-01-01 14:00:00 0.179458 0.388946 0.596706
2020-01-01 15:00:00 0.702992 0.300526 0.476644
2020-01-01 16:00:00 0.702992 0.516952 0.476644
2020-01-01 17:00:00 0.702992 0.733378 0.476644
2020-01-01 18:00:00 0.884276 0.949804 0.793237
2020-01-01 19:00:00 0.884276 0.907233 0.793237
2020-01-01 20:00:00 0.884276 0.864661 0.793237
2020-01-01 21:00:00 0.283859 0.822090 0.186542
2020-01-01 22:00:00 0.283859 0.834956 0.186542
2020-01-01 23:00:00 0.283859 0.847822 0.186542
2020-01-02 00:00:00 0.410897 0.860688 0.894249

Categories

Resources