I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.
Problem
Problem set up
#%% Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y
Desired output
Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.
Possible solutions
Simply use the 15 minutes resampling and just live with the small introduced error.
Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.
Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array
Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.
Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation
Edit: #Paul H gave a workable solution along these lines, which is stille readable. Thanks!
All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.
Implementation
The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.
#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')
#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0
j = 0
k = 0
for val in ydesired:
if i+k==len(y): k=0
ydesired[j] = np.mean([y[i],y[i+k]])
j+=1
i+=1
if k==0: k=1;
else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
#%% finalize plot
plt.legend()
Implementation for angles
As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.
#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')
#%% convert complex to degrees
def f(x):
return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0]
#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360
#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according #Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()
I might be misunderstanding the problem, but does this work?
TL;DR version:
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])
Long version
setup fake data
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)
A
2012-01-01 00:00:00 0
2012-01-01 00:10:00 8
2012-01-01 00:20:00 16
2012-01-01 00:30:00 24
2012-01-01 00:40:00 32
2012-01-01 00:50:00 40
2012-01-01 01:00:00 48
2012-01-01 01:10:00 56
2012-01-01 01:20:00 64
2012-01-01 01:30:00 72
2012-01-01 01:40:00 80
2012-01-01 01:50:00 88
2012-01-01 02:00:00 96
So then build a new 5-minute index and reindex the original dataframe
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00 8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00 16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00 24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00 32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00 40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00 48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00 56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00 64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00 72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00 80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00 88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00 96
and then linearly interpolate
print(df2.interpolate())
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 4
2012-01-01 00:10:00 8
2012-01-01 00:15:00 12
2012-01-01 00:20:00 16
2012-01-01 00:25:00 20
2012-01-01 00:30:00 24
2012-01-01 00:35:00 28
2012-01-01 00:40:00 32
2012-01-01 00:45:00 36
2012-01-01 00:50:00 40
2012-01-01 00:55:00 44
2012-01-01 01:00:00 48
2012-01-01 01:05:00 52
2012-01-01 01:10:00 56
2012-01-01 01:15:00 60
2012-01-01 01:20:00 64
2012-01-01 01:25:00 68
2012-01-01 01:30:00 72
2012-01-01 01:35:00 76
2012-01-01 01:40:00 80
2012-01-01 01:45:00 84
2012-01-01 01:50:00 88
2012-01-01 01:55:00 92
2012-01-01 02:00:00 96
build a 15-minute index and use that to pull out data:
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])
A
2012-01-01 00:00:00 0
2012-01-01 00:15:00 12
2012-01-01 00:30:00 24
2012-01-01 00:45:00 36
2012-01-01 01:00:00 48
2012-01-01 01:15:00 60
2012-01-01 01:30:00 72
2012-01-01 01:45:00 84
2012-01-01 02:00:00 96
Ok, here's one way to do it.
Make a list of the times you want to have filled in
Make a combined index that includes the times you want and the times you already have
Take your data and "forward fill it"
Take your data and "backward fill it"
Average the forward and backward fills
Select only the rows you want
Note this only works since you want the values exactly halfway between the values you already have, time-wise. Note the last time comes out np.nan because you don't have any later data.
times_15 = []
current = df.index[0]
while current < df.index[-2]:
current = current + dt.timedelta(minutes=15)
times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]
In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.
If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.
import numpy as np
import pandas
data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
A
2022-01-01 00:03:00 9
2022-01-01 00:06:00 18
2022-01-01 00:08:00 24
2022-01-01 00:18:00 54
2022-01-01 00:25:00 75
2022-01-01 00:27:00 81
2022-01-01 00:30:00 90
Notice resampling this DF without any regularity forces values to the floor index, without interpolating.
print(df1.resample('05T').mean())
A
2022-01-01 00:00:00 9.0
2022-01-01 00:05:00 24.0
2022-01-01 00:10:00 39.0
2022-01-01 00:15:00 51.0
2022-01-01 00:20:00 NaN
2022-01-01 00:25:00 79.5
A better solution can be achieved by interpolating in a small enough interval and then resampling. The result DF now has too much, but a dropna() brings it close to its original shape.
index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())
A A_interp
2022-01-01 00:00:00 9.0 9.0
2022-01-01 00:05:00 21.0 15.0
2022-01-01 00:10:00 39.0 30.0
2022-01-01 00:15:00 51.0 45.0
2022-01-01 00:25:00 75.0 75.0
Related
I have the following pandas df (datetime is of type datetime64):
device datetime
0 846ee 2020-03-22 14:27:29
1 0a26e 2020-03-22 15:33:31
2 8a906 2020-03-27 16:19:06
3 6bf11 2020-03-27 16:05:20
4 d3923 2020-03-23 18:58:51
I wanted to use the KDE function of Seaborn's distplot. Even though I don't exactly understand why, I got this to work:
df['hour'] = df['datetime'].dt.floor('T').dt.time
df['hour'] = pd.to_timedelta(df['hour'].astype(str)) / pd.Timedelta(hours=1)
and then
sns.distplot(df['hour'], hist=False, bins=arr, label='tef')
The question is: how do I do the same, but only counting unique devices? I have tried
df.groupby(['hour']).nunique().reset_index()
df.groupby(['hour'])[['device']].size().reset_index()
But they give me different results (same order of magnitude, but a few more or less). I think I don't understand what I'm doing in pd.to_timedelta(df['hour'].astype(str)) / pd.Timedelta(hours=1) and that is preventing me to think about the uniques... maybe.
pd.to_timedelta(df['time'].astype(str)) crates an output like 0 days 01:00:00
pd.to_timedelta(df['time'].astype(str)) / pd.Timedelta(hours=1) create an output like 1.00, which is a float of the hours.
See Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes for a more detailed discussion of timedeltas.
import pandas as pd
import numpy as np # for test data
import random # for test data
# test data
np.random.seed(365)
random.seed(365)
rows = 40
data = {'device': [random.choice(['846ee', '0a26e', '8a906', '6bf11', 'd3923']) for _ in range(rows)],
'datetime': pd.bdate_range(datetime(2020, 7, 1), freq='15min', periods=rows).tolist()}
# create test dataframe
df = pd.DataFrame(data)
# this date column is already in a datetime format; for the real dataframe, make sure it's converted
# df.datetime = pd.to_datetime(df.datetime)
# this extracts the time component from the datetime and is a datetime.time object
df['time'] = df['datetime'].dt.floor('T').dt.time
# this creates a timedelta column; note it's format
df['timedelta'] = pd.to_timedelta(df['time'].astype(str))
# this creates a float representing the hour and its fractional component (minutes)
df['hours'] = pd.to_timedelta(df['time'].astype(str)) / pd.Timedelta(hours=1)
# extracts just the hour
df['hour'] = df['datetime'].dt.hour
display(df.head())
This view should elucidate the difference between the time extraction methods.
device datetime time timedelta hours hour
0 8a906 2020-07-01 00:00:00 00:00:00 0 days 00:00:00 0.00 0
1 0a26e 2020-07-01 00:15:00 00:15:00 0 days 00:15:00 0.25 0
2 8a906 2020-07-01 00:30:00 00:30:00 0 days 00:30:00 0.50 0
3 d3923 2020-07-01 00:45:00 00:45:00 0 days 00:45:00 0.75 0
4 0a26e 2020-07-01 01:00:00 01:00:00 0 days 01:00:00 1.00 1
5 d3923 2020-07-01 01:15:00 01:15:00 0 days 01:15:00 1.25 1
6 6bf11 2020-07-01 01:30:00 01:30:00 0 days 01:30:00 1.50 1
7 d3923 2020-07-01 01:45:00 01:45:00 0 days 01:45:00 1.75 1
8 6bf11 2020-07-01 02:00:00 02:00:00 0 days 02:00:00 2.00 2
9 d3923 2020-07-01 02:15:00 02:15:00 0 days 02:15:00 2.25 2
10 0a26e 2020-07-01 02:30:00 02:30:00 0 days 02:30:00 2.50 2
11 846ee 2020-07-01 02:45:00 02:45:00 0 days 02:45:00 2.75 2
12 0a26e 2020-07-01 03:00:00 03:00:00 0 days 03:00:00 3.00 3
13 846ee 2020-07-01 03:15:00 03:15:00 0 days 03:15:00 3.25 3
14 846ee 2020-07-01 03:30:00 03:30:00 0 days 03:30:00 3.50 3
Plot device counts for each hour with seaborn.countplot
plt.figure(figsize=(8, 6))
sns.countplot(x='hour', hue='device', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
Plot a seaborn.distplot for each device
Use seaborn.FacetGrid
This will give the hourly distribution for each device.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, row='device', height=5)
g.map(sns.distplot, 'hours', bins=24, kde=True)
g.set(xlim=(0, 24), xticks=range(0, 25, 1))
You can try
df['hour'] = df['datetime'].dt.strftime('%Y-%m-%d %H')
s = df.groupby('hour')['device'].value_counts()
I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64
I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04
I have a dataframe with a column "time" of float numbers, representing days from 0 to 8, and one more column with other data. The time step is not continuous.
time_clean = np.arange(0, 8, 0.1)
noise = [random.random()/10 for n in range(len(time_clean))]
time = time_clean + noise
data = [random.random()*100 for n in range(len(time_clean))]
df = pd.DataFrame({"time": time, "data":data})
df.head()
data time
0 89.965240 0.041341
1 95.964621 0.109215
2 70.552763 0.232596
3 74.457244 0.330750
4 13.228426 0.471623
I want to resample and interpolate the data to every 15 minutes, (15/(60*24) days).
I think the most efficient way to do this would be using the resample method of pandas dataframes, but in order to do that I need to convert the time column into a timestamp, and make it the index.
What is the most efficient way of doing this? Is it possible to transform an int to datetime?
I think you need first convert column time to_timedelta and then sort_values with resample:
Also I think the best is add one new row with 0 for always starts resample from 0 (if 0 is not in time column it starts from minimal time value)
df.loc[-1] = 0
df.time = pd.to_timedelta(df.time, unit='d')
df = df.sort_values('time').set_index('time').resample('15T').ffill()
print (df.head(20))
data
time
00:00:00 0.000000
00:15:00 0.000000
00:30:00 0.000000
00:45:00 0.000000
01:00:00 0.000000
01:15:00 0.000000
01:30:00 50.869889
01:45:00 50.869889
02:00:00 50.869889
02:15:00 50.869889
02:30:00 50.869889
02:45:00 50.869889
03:00:00 50.869889
03:15:00 8.846017
03:30:00 8.846017
03:45:00 8.846017
04:00:00 8.846017
04:15:00 8.846017
04:30:00 8.846017
04:45:00 8.846017
I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil
Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.
I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot