I have a dataframe as follows (reproducible data):
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(365)
rows = 2
start_date=datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
data = np.random.uniform(2.1, 6.5, size=(rows, cols))
index = pd.bdate_range(start_date, freq='H', periods=rows)
df = pd.DataFrame(data=data, index=index, columns=['Ta']) #Ta: Room temperature
Ta
2020-01-01 00:00:00 6.242405
2020-01-01 01:00:00 4.923052
2020-01-01 02:00:00 5.112286
2020-01-01 03:00:00 4.689673
2020-01-01 04:00:00 4.493104
2020-01-01 05:00:00 3.719512
2020-01-01 06:00:00 5.473153
2020-01-01 07:00:00 3.442055
2020-01-01 08:00:00 4.045178
2020-01-01 09:00:00 2.585951
2020-01-01 10:00:00 4.028845
2020-01-01 11:00:00 5.411510
2020-01-01 12:00:00 3.383155
2020-01-01 13:00:00 5.997180
2020-01-01 14:00:00 6.485442
2020-01-01 15:00:00 4.240901
2020-01-01 16:00:00 3.637405
2020-01-01 17:00:00 2.766216
2020-01-01 18:00:00 6.024569
2020-01-01 19:00:00 5.503587
2020-01-01 20:00:00 5.532941
2020-01-01 21:00:00 4.251602
2020-01-01 22:00:00 4.444596
2020-01-01 23:00:00 2.935362
I'm trying to plot temperature along the entire day, but I can’t see the ticks marks with the specific date. Only the first one appears and I want to see every tick.
Here's the code:
df['Ta'].plot(figsize=(20,12),legend=True,subplots=True,ylim=(0,12),
xticks=list(df.index.values),fontsize=10,grid=True,
rot=0, xlim=(pd.Timestamp('2020-01-01 00:00:00'),pd.Timestamp('2020-01-01 23:00:00')))
Plot
I have tried everything that is on my hands, but I can't figure it out.
Use matplotlib annotation to attach labels to the chart:
data="""Date,Ta
2020-01-01 00:00:00,6.242405
2020-01-01 01:00:00,4.923052
2020-01-01 02:00:00,5.112286
2020-01-01 03:00:00,4.689673
2020-01-01 04:00:00,4.493104
2020-01-01 05:00:00,3.719512
2020-01-01 06:00:00,5.473153
2020-01-01 07:00:00,3.442055
2020-01-01 08:00:00,4.045178
2020-01-01 09:00:00,2.585951
2020-01-01 10:00:00,4.028845
2020-01-01 11:00:00,5.411510
2020-01-01 12:00:00,3.383155
2020-01-01 13:00:00,5.997180
2020-01-01 14:00:00,6.485442
2020-01-01 15:00:00,4.240901
2020-01-01 16:00:00,3.637405
2020-01-01 17:00:00,2.766216
2020-01-01 18:00:00,6.024569
2020-01-01 19:00:00,5.503587
2020-01-01 20:00:00,5.532941
2020-01-01 21:00:00,4.251602
2020-01-01 22:00:00,4.444596
2020-01-01 23:00:00,2.935362
"""
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(20,12), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.xlim([pd.Timestamp('2020-01-01 00:00:00'),pd.Timestamp('2020-01-01 23:00:00')])
plt.ylim(0,12)
items=range(0,len(df))
for index in items:
y2=y[index]
x2=x[index]
value="{:.2f}".format(y2)
plt.annotate(xy=[x2,y2],s=value)
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
df = pd.read_csv(StringIO(data), sep=',', index_col=0,parse_dates=['Date'])
plot_df(df, x=df.index, y=df.Ta, title='Temparature')
Filter your dataframe before the plot:
df.loc[df.index.normalize() == '2020-01-01'] \
.plot(figsize=(20,12), legend=True, subplots=True,
ylim=(0,12), fontsize=10, grid=True)
Related
I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]
I have a DataFrame that resembles as follows:
import pandas as pd
import numpy as np
date = pd.date_range(start='2020-01-01', freq='H', periods=4)
locations = ["AA3", "AB1", "AD1", "AC0"]
x = [5.5, 10.2, np.nan, 2.3, 11.2, np.nan, 2.1, 4.0, 6.1, np.nan, 20.3, 11.3, 4.9, 15.2, 21.3, np.nan]
df = pd.DataFrame({'x': x})
df.index = pd.MultiIndex.from_product([locations, date], names=['location', 'date'])
df = df.sort_index()
df
x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 NaN
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 NaN
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
Index values are location codes and hours of the day. I want to fill missing values of column x with valid value of the same column from the nearest location on the same day and hour, where the proximity of each location to other locations is defined as
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
nearest
AA3 AB1 AD1 AC0
0 AA3 AB1 AD1 AC0
1 AB1 AA3 AC0 AD1
2 AD1 AC0 AB1 AA3
3 AC0 AD1 AA1 AB1
In this dataset, column names are location codes and row values under each column indicate other locations in order of their proximity to the location whose name is give as column name.
If the nearest location also has missing value on the same day and hour, then I would take the value of the second nearest location on the same day and hour. If the second nearest location is missing, then the third nearest location on the same day and hour, and so on.
Desired output:
x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 11.3
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
The following based on suggestions by #kiona1018 works as intended but it is slow.
def fillna_by_nearest(x: pd.Series, nn_data: pd.DataFrame):
out = x.copy()
for index, value in x.iteritems():
if np.isnan(value) and (index[0] in nn_data.columns):
location, date = index
for near_location in nn_data[location]:
if ((near_location, date) in x.index) and pd.notna(x.loc[near_location, date]):
out.loc[index] = x.loc[near_location, date]
break
return out
fillna_by_nearest(df['x'], nearest)
I agree with Serial Lazer that there are no neater pandas/numpy executions for this. The requirement is a little bit complicated. In such a case, you should make your own function. An example is below.
nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
def fill_by_nearest(sr: pd.Series):
if not np.isnan(sr['x']):
return sr
location = sr.name[0]
date = sr.name[1]
for near_location in nearest[location]:
if not np.isnan(df.loc[near_location, date]['x']):
sr['x'] = df.loc[near_location, date]['x']
return sr
return sr
df = df.apply(fill_by_nearest, axis=1)
You can use apply function:
def find_nearest(row):
for item in list(nearest[row['location']]):
if len(df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))]):
return df[lambda x: (x['location']==item) & (x['date']==row['date']) &(~pd.isnull(x['x']))].x.values[0]
df = df.reset_index()
df = df.assign(x = lambda x: x.apply(find_nearest,axis=1))
Output:
location date x
0 AA3 2020-01-01 00:00:00 5.5
1 AA3 2020-01-01 01:00:00 10.2
2 AA3 2020-01-01 02:00:00 2.1
3 AA3 2020-01-01 03:00:00 2.3
4 AB1 2020-01-01 00:00:00 11.2
5 AB1 2020-01-01 01:00:00 10.2
6 AB1 2020-01-01 02:00:00 2.1
7 AB1 2020-01-01 03:00:00 4.0
8 AC0 2020-01-01 00:00:00 4.9
9 AC0 2020-01-01 01:00:00 15.2
10 AC0 2020-01-01 02:00:00 21.3
11 AC0 2020-01-01 03:00:00 11.3
12 AD1 2020-01-01 00:00:00 6.1
13 AD1 2020-01-01 01:00:00 15.2
14 AD1 2020-01-01 02:00:00 20.3
15 AD1 2020-01-01 03:00:00 11.3
I have a df as follows:
dates values
2020-01-01 00:15:00 87.321
2020-01-01 00:30:00 87.818
2020-01-01 00:45:00 88.514
2020-01-01 01:00:00 89.608
2020-01-01 01:15:00 90.802
2020-01-01 01:30:00 91.896
2020-01-01 01:45:00 92.393
2020-01-01 02:00:00 91.995
2020-01-01 02:15:00 90.504
2020-01-01 02:30:00 88.216
2020-01-01 02:45:00 85.929
2020-01-01 03:00:00 84.238
I want to just keep hourly values when the minute is 00 and the values occurring before it must be added.
Example: For finding the value at 2020-01-01 01:00:00, the values from 2020-01-01 00:15:00 to 2020-01-01 01:00:00 should be added (87.321+87.818+88.514+59.608 = 353.261). Similarly, for finding the value at 2020-01-01 02:00:00, the values from 2020-01-01 01:15:00 to 2020-01-01 02:00:00 should be added (90.802+91.896+92.393+91.995 = 348.887)
Desired output
dates values
2020-01-01 01:00:00 353.261
2020-01-01 02:00:00 348.887
2020-01-01 03:00:00 333.67
I used df['dates'].dt.minute.eq(0) to obtain the boolean masking, but I am unable to find a way to add them.
Thanks in advance
hourly = df.set_index('dates') \ # Set the dates as index
.resample('1H', closed='right', label='right') \ # Resample, so that you have one value for each hour
.sum() # Set the sum of values as new value
hourly = hourly.reset_index() # If you want to have the dates as column again
I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00
i need to resample a df, different columns with different functions.
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(start='2020-01-01 00:00:00', end='2020-01-02 00:00:00', freq='3H'),
data=np.random.rand(9,3),
columns=['A','B','C'])
df = df.resample('1H').agg({'A': 'ffill',
'B': 'interpolate',
'C': 'max'})
Functions like 'mean', 'max', 'sum' work.
But it seems that 'interpolate' can't be used like that.
Any workarounds?
One work-around would be to concat the different aggregated frames:
df_out = pd.concat([df[['A']].resample('1H').ffill(),
df[['B']].resample('1H').interpolate(),
df[['C']].resample('1H').max().ffill()],
axis=1)
[out]
A B C
2020-01-01 00:00:00 0.836547 0.436186 0.520913
2020-01-01 01:00:00 0.836547 0.315646 0.520913
2020-01-01 02:00:00 0.836547 0.195106 0.520913
2020-01-01 03:00:00 0.577291 0.074566 0.754697
2020-01-01 04:00:00 0.577291 0.346092 0.754697
2020-01-01 05:00:00 0.577291 0.617617 0.754697
2020-01-01 06:00:00 0.490666 0.889143 0.685191
2020-01-01 07:00:00 0.490666 0.677584 0.685191
2020-01-01 08:00:00 0.490666 0.466025 0.685191
2020-01-01 09:00:00 0.603678 0.254466 0.605424
2020-01-01 10:00:00 0.603678 0.358240 0.605424
2020-01-01 11:00:00 0.603678 0.462014 0.605424
2020-01-01 12:00:00 0.179458 0.565788 0.596706
2020-01-01 13:00:00 0.179458 0.477367 0.596706
2020-01-01 14:00:00 0.179458 0.388946 0.596706
2020-01-01 15:00:00 0.702992 0.300526 0.476644
2020-01-01 16:00:00 0.702992 0.516952 0.476644
2020-01-01 17:00:00 0.702992 0.733378 0.476644
2020-01-01 18:00:00 0.884276 0.949804 0.793237
2020-01-01 19:00:00 0.884276 0.907233 0.793237
2020-01-01 20:00:00 0.884276 0.864661 0.793237
2020-01-01 21:00:00 0.283859 0.822090 0.186542
2020-01-01 22:00:00 0.283859 0.834956 0.186542
2020-01-01 23:00:00 0.283859 0.847822 0.186542
2020-01-02 00:00:00 0.410897 0.860688 0.894249