matplotlib plots strange horizontal lines on graph - python

I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil

Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.

I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot

Related

Efficently slicing non-integer multilevel indexes with integers in Pandas

The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime.
Script
import pandas as pd
from datetime import datetime
import random
df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed'])
networks = ['ALPHA','BETA','GAMMA']
times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist()
for n in networks:
for t in times:
df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True)
df.set_index(['network','time'],inplace=True)
print(df.to_string())
Output
active_clients throughput speed
network time
ALPHA 2021-01-01 00:00:00 16 4044 11023
2021-01-01 02:00:00 17 2966 10933
2021-01-01 04:00:00 10 4649 11981
2021-01-01 06:00:00 23 3629 10113
2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 00:00:00 17 3073 11798
2021-01-01 02:00:00 20 1941 10640
2021-01-01 04:00:00 17 1980 11869
2021-01-01 06:00:00 23 3346 10002
2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 00:00:00 21 4366 11587
2021-01-01 02:00:00 22 3404 11669
2021-01-01 04:00:00 20 1608 10344
2021-01-01 06:00:00 28 1849 10278
2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following.
Desired output
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
My attempts
Returns the full dataframe:
df_sel = df.iloc[:,-3:]
Raises an error because loc doesn't support using integer values on datetime objects:
df_sel = df.loc[:,-3:]
Returns the last three entries in the second level, but only for the last entry in the first level:
df_sel = df.loc[:].iloc[-3:]
I have 2 methods to solve this problem:
Method 1:
As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code:
df.groupby(level=0).tail(3)
Method 2:
You can also slice each one in networks then concat them:
pd.concat([df.loc[[i]][-3:] for i in networks])
Both of these 2 methods will output the result you want:
Another method is to do some reshaping:
df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index()
Output:
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 26 4081 11325
2021-01-01 10:00:00 13 3370 10716
2021-01-01 12:00:00 13 3691 10737
BETA 2021-01-01 08:00:00 28 2105 10465
2021-01-01 10:00:00 21 2444 10158
2021-01-01 12:00:00 24 1947 11226
GAMMA 2021-01-01 08:00:00 13 1850 10288
2021-01-01 10:00:00 23 2241 11521
2021-01-01 12:00:00 30 3515 11138
Details:
unstack the outer most index level, level=0
Use, iloc to select the last three records in the dataframe
stack that level back to the index swaplevel and sort_index

creating pandas-vectorized 'subtraction' table

I have a Series with a DatetimeIndex and an integer value. I want to make a table that shows the change in value from each time to all the other subsequent times.
Below is a visual representation of what I want. The gray and orange cells are irrelevant data.
I can't figure out a way to create this in a vectorized style inside pandas.
z = pd.DatetimeIndex(periods=10, freq='H', start='2018-12-1')
import random
df = pd.DataFrame(random.sample(range(1, 100), 10), index=z, columns=['foo'])
I've tried things like:
df['foo'].sub(df['foo'].transpose())
But that doesn't work.
The output DataFrame could either have a multindex (beforeTime, AfterTime) or could be a single index "beforeTime" and then have a column for each possible "aftertime". I think they're equivalent, as I can use the unstack() and related functions to get the shape I want?
I think you can use np.substract with np.outer to calculate all the values and create the dataframe like:
df_output = pd.DataFrame(np.subtract.outer(df.foo, df.foo),
columns= df.index.time, index=df.index.time)
print (df_output.head())
00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 \
00:00:00 0 6 -7 -57 -33 3
01:00:00 -6 0 -13 -63 -39 -3
02:00:00 7 13 0 -50 -26 10
03:00:00 57 63 50 0 24 60
04:00:00 33 39 26 -24 0 36
06:00:00 07:00:00 08:00:00 09:00:00
00:00:00 -53 -28 5 17
01:00:00 -59 -34 -1 11
02:00:00 -46 -21 12 24
03:00:00 4 29 62 74
04:00:00 -20 5 38 50
You can use np.triu to set to 0 all the values in grey in your example such as:
pd.DataFrame(np.triu(np.subtract.outer(df.foo, df.foo)), columns = ...)
Note the .time is not necessary when creating the columns= and index=, it was to copy and paste a dataframe readable

Python Pandas: Aggregate data by hour and display it instead of the index

I would like to aggregate some data by hour using pandas and display the date instead of an index.
The code I have right now is the following:
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2011', periods=20, freq='25min')
data = pd.Series(np.random.randint(100, size=20), index=dates)
result = data.groupby(data.index.hour).sum().reset_index(name='Sum')
print(result)
Which displays something along the lines of:
index Sum
0 0 131
1 1 116
2 2 180
3 3 62
4 4 95
5 5 107
6 6 89
7 7 169
The problem is that instead of index I want to display the date associated with that hour.
The result I'm trying to achieve is the following:
index Sum
0 2011-01-01 01:00:00 131
1 2011-01-01 02:00:00 116
2 2011-01-01 03:00:00 180
3 2011-01-01 04:00:00 62
4 2011-01-01 05:00:00 95
5 2011-01-01 06:00:00 107
6 2011-01-01 07:00:00 89
7 2011-01-01 08:00:00 169
Is there any way I can do that easily using pandas?
data.groupby(data.index.strftime('%Y-%m-%d %H:00:00')).sum().reset_index(name='Sum')
You could use resample.
data.resample('H').sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
Freq: H, dtype: int32
Option #2
data.groupby(data.index.floor('H')).sum()
Output:
2011-01-01 00:00:00 84
2011-01-01 01:00:00 121
2011-01-01 02:00:00 160
2011-01-01 03:00:00 70
2011-01-01 04:00:00 88
2011-01-01 05:00:00 131
2011-01-01 06:00:00 56
2011-01-01 07:00:00 109
dtype: int32

python pandas String to TimeStramps convert ambigous

I'm trying to slice a Dataframe using DateTimeIndex, but a got one issue.
When the new DataFrame Change Month, he switch the day and the month.
Here is my dataframe:
Valeur
date
2015-01-08 00:00:00 93
2015-01-08 00:10:00 90
2015-01-08 00:20:00 88
2015-01-08 00:30:00 103
2015-01-08 00:40:00 86
2015-01-08 00:50:00 88
2015-01-08 01:00:00 86
2015-01-08 01:10:00 84
2015-01-08 01:20:00 95
2015-01-08 01:30:00 88
2015-01-08 01:40:00 85
2015-01-08 01:50:00 92
... ...
2016-10-30 22:20:00 98
2016-10-30 22:30:00 94
2016-10-30 22:40:00 94
2016-10-30 22:50:00 103
2016-10-30 23:00:00 92
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
[65814 rows x 1 columns]
Here my two TimeStamps:
startingDate : 2015-10-31 23:50:00
lastDate : 2016-10-30 23:50:00
When i slice my df like this :
dfconso = dfconso[startingDate:lastDate]
i got something like this :
Valeur
date
2015-10-31 23:50:00 88
2015-01-11 00:00:00 83
2015-01-11 00:10:00 82
2015-01-11 00:20:00 87
2015-01-11 00:30:00 77
2015-01-11 00:40:00 72
2015-01-11 00:50:00 86
2015-01-11 01:00:00 77
2015-01-11 01:10:00 80
... ...
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
The problem is the slice start at the good date, but when the DateTimeIndex change month, something wrong append.
Pass from 31 October 2015 to 11 January 2015.
And i don't understand why..
I try to print the month and day to see and i got that :
In:
print("Index 0 : month", dfconso.index[0].month, ", day", dfconso.index[0].day)
print("Index 1 : month", dfconso.index[1].month, ", day", dfconso.index[1].day)
Out:
Index 0 : month 10 , day 31
Index 1 : month 1 , day 11
If someone has an idea
EDIT :
After df.sort_index() my df, i can see the convert of String date to TimeStamps date, didn't work sometimes, and switch Month and Day.
Format at String :
"31/08/2015 20:00:00"
My code to transform from String to TimeStamps:
dfconso.index = pd.to_datetime(dfconso.index, infer_datetime_format=True, format="%d/%m/%Y")
SOLUTION :
that was a bad use of pd.to_datetime, i change infer_date_time_format to Dayfirst :
dfconso.index = pd.to_datetime(dfconso.index, dayfirst=True)
That solve my problem.
The error might not be a mixup of day and month, but just an ordering problem. Try reordering the data before slicing it (the provided part of your data looks fine, but who knows about the rest..).
Here is how reordering works: Sort a pandas datetime index

Pandas.resample to a non-integer multiple frequency

I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.
Problem
Problem set up
#%% Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y
Desired output
Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.
Possible solutions
Simply use the 15 minutes resampling and just live with the small introduced error.
Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.
Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array
Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.
Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation
Edit: #Paul H gave a workable solution along these lines, which is stille readable. Thanks!
All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.
Implementation
The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.
#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')
#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0
j = 0
k = 0
for val in ydesired:
if i+k==len(y): k=0
ydesired[j] = np.mean([y[i],y[i+k]])
j+=1
i+=1
if k==0: k=1;
else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
#%% finalize plot
plt.legend()
Implementation for angles
As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.
#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')
#%% convert complex to degrees
def f(x):
return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0]
#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360
#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according #Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()
I might be misunderstanding the problem, but does this work?
TL;DR version:
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])
Long version
setup fake data
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)
A
2012-01-01 00:00:00 0
2012-01-01 00:10:00 8
2012-01-01 00:20:00 16
2012-01-01 00:30:00 24
2012-01-01 00:40:00 32
2012-01-01 00:50:00 40
2012-01-01 01:00:00 48
2012-01-01 01:10:00 56
2012-01-01 01:20:00 64
2012-01-01 01:30:00 72
2012-01-01 01:40:00 80
2012-01-01 01:50:00 88
2012-01-01 02:00:00 96
So then build a new 5-minute index and reindex the original dataframe
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00 8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00 16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00 24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00 32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00 40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00 48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00 56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00 64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00 72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00 80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00 88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00 96
and then linearly interpolate
print(df2.interpolate())
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 4
2012-01-01 00:10:00 8
2012-01-01 00:15:00 12
2012-01-01 00:20:00 16
2012-01-01 00:25:00 20
2012-01-01 00:30:00 24
2012-01-01 00:35:00 28
2012-01-01 00:40:00 32
2012-01-01 00:45:00 36
2012-01-01 00:50:00 40
2012-01-01 00:55:00 44
2012-01-01 01:00:00 48
2012-01-01 01:05:00 52
2012-01-01 01:10:00 56
2012-01-01 01:15:00 60
2012-01-01 01:20:00 64
2012-01-01 01:25:00 68
2012-01-01 01:30:00 72
2012-01-01 01:35:00 76
2012-01-01 01:40:00 80
2012-01-01 01:45:00 84
2012-01-01 01:50:00 88
2012-01-01 01:55:00 92
2012-01-01 02:00:00 96
build a 15-minute index and use that to pull out data:
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])
A
2012-01-01 00:00:00 0
2012-01-01 00:15:00 12
2012-01-01 00:30:00 24
2012-01-01 00:45:00 36
2012-01-01 01:00:00 48
2012-01-01 01:15:00 60
2012-01-01 01:30:00 72
2012-01-01 01:45:00 84
2012-01-01 02:00:00 96
Ok, here's one way to do it.
Make a list of the times you want to have filled in
Make a combined index that includes the times you want and the times you already have
Take your data and "forward fill it"
Take your data and "backward fill it"
Average the forward and backward fills
Select only the rows you want
Note this only works since you want the values exactly halfway between the values you already have, time-wise. Note the last time comes out np.nan because you don't have any later data.
times_15 = []
current = df.index[0]
while current < df.index[-2]:
current = current + dt.timedelta(minutes=15)
times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]
In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.
If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.
import numpy as np
import pandas
data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
A
2022-01-01 00:03:00 9
2022-01-01 00:06:00 18
2022-01-01 00:08:00 24
2022-01-01 00:18:00 54
2022-01-01 00:25:00 75
2022-01-01 00:27:00 81
2022-01-01 00:30:00 90
Notice resampling this DF without any regularity forces values to the floor index, without interpolating.
print(df1.resample('05T').mean())
A
2022-01-01 00:00:00 9.0
2022-01-01 00:05:00 24.0
2022-01-01 00:10:00 39.0
2022-01-01 00:15:00 51.0
2022-01-01 00:20:00 NaN
2022-01-01 00:25:00 79.5
A better solution can be achieved by interpolating in a small enough interval and then resampling. The result DF now has too much, but a dropna() brings it close to its original shape.
index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())
A A_interp
2022-01-01 00:00:00 9.0 9.0
2022-01-01 00:05:00 21.0 15.0
2022-01-01 00:10:00 39.0 30.0
2022-01-01 00:15:00 51.0 45.0
2022-01-01 00:25:00 75.0 75.0

Categories

Resources