I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64
Related
Based on the pandas documentation from here: Docs
And the examples:
>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Freq: T, dtype: int64
After resampling:
>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00 0
2000-01-01 00:03:00 6
2000-01-01 00:06:00 15
2000-01-01 00:09:00 15
In my thoughts, the bins should looks like these after resampling:
=========bin 01=========
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
=========bin 02=========
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
=========bin 03=========
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Am I right on this step??
So after .sum I thought it should be like this:
2000-01-01 00:02:00 3
2000-01-01 00:05:00 12
2000-01-01 00:08:00 21
I just do not understand how it comes out:
2000-01-01 00:00:00 0
(because label='right', 2000-01-01 00:00:00 cannot be any right edge of any bins in this case).
2000-01-01 00:09:00 15
(the label 2000-01-01 00:09:00 even does not exists in the original Series.
Short answer: If you use closed='left' and loffset='2T' then you'll get what you expected:
series.resample('3T', label='left', closed='left', loffset='2T').sum()
2000-01-01 00:02:00 3
2000-01-01 00:05:00 12
2000-01-01 00:08:00 21
Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. < vs <=).
An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of closed:
closed='right' => ( 3:00, 6:00 ] or 3:00 < x <= 6:00
closed='left' => [ 3:00, 6:00 ) or 3:00 <= x < 6:00
You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example:
https://en.wikipedia.org/wiki/Interval_(mathematics)
The label parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves.
Also note that you can change the starting point for the intervals with the loffset parameter (which should be entered as a time delta).
Back to the example, where we change only the labeling from 'right' to 'left':
series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00 0
2000-01-01 00:03:00 6
2000-01-01 00:06:00 15
2000-01-01 00:09:00 15
series.resample('3T', label='left', closed='right').sum()
1999-12-31 23:57:00 0
2000-01-01 00:00:00 6
2000-01-01 00:03:00 15
2000-01-01 00:06:00 15
As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where ( on the left side means open and ] on the right side means closed):
( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ] 0 # = 0
( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ] 6 # = 1+2+3
( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ] 15 # = 4+5+6
( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ] 15 # = 7+8
Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious:
series.resample('3T', label='left', closed='right').count()
1999-12-31 23:57:00 1
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all:
It is important that resampling is performed by first producing a raster which is a sequence of instants (not periods, intervals, durations), and it is done independent of the 'label' and 'closed' parameters. It uses only the 'freq' parameter and 'loffset'. In your case, the system will produce the following raster:
2000-01-01 00:00:00
2000-01-01 00:03:00
2000-01-01 00:06:00
2000-01-01 00:09:00
Note again that at this moment there is no interpretation in terms of intervals or periods. You can shift it using 'loffset'.
Then the system will use the 'closed' parameter in ordre to choose among two options:
(start, end]
[start, end)
Here start and end are two adjacent time stamps in the raster. The 'label' parameter is used to choose whether start or end are used as a representative of the interval.
In your example, if you choose closed='right' then you will get the following intervals:
( previous_interval , 2000-01-01 00:00:00] - {0}
(2000-01-01 00:00:00, 2000-01-01 00:03:00] - {1,2,3}
(2000-01-01 00:03:00, 2000-01-01 00:06:00] - {1,2,3}
(2000-01-01 00:06:00, 2000-01-01 00:09:00] - {4,5,6}
(2000-01-01 00:09:00, next_interval ] - {7,8}
Note that after you aggregate the values over these intervals, the result is displayed in two versions depending on the 'label' parameter, that is, whether one and the same interval is represented by its left or right time stamp.
I now realized how it works, but still the strange thing about this is why the additional timestamp is added at the right side, which is counter-intuitive in a way. I guess this is similar to the range or iloc thing.
I have a dataframe with a column "time" of float numbers, representing days from 0 to 8, and one more column with other data. The time step is not continuous.
time_clean = np.arange(0, 8, 0.1)
noise = [random.random()/10 for n in range(len(time_clean))]
time = time_clean + noise
data = [random.random()*100 for n in range(len(time_clean))]
df = pd.DataFrame({"time": time, "data":data})
df.head()
data time
0 89.965240 0.041341
1 95.964621 0.109215
2 70.552763 0.232596
3 74.457244 0.330750
4 13.228426 0.471623
I want to resample and interpolate the data to every 15 minutes, (15/(60*24) days).
I think the most efficient way to do this would be using the resample method of pandas dataframes, but in order to do that I need to convert the time column into a timestamp, and make it the index.
What is the most efficient way of doing this? Is it possible to transform an int to datetime?
I think you need first convert column time to_timedelta and then sort_values with resample:
Also I think the best is add one new row with 0 for always starts resample from 0 (if 0 is not in time column it starts from minimal time value)
df.loc[-1] = 0
df.time = pd.to_timedelta(df.time, unit='d')
df = df.sort_values('time').set_index('time').resample('15T').ffill()
print (df.head(20))
data
time
00:00:00 0.000000
00:15:00 0.000000
00:30:00 0.000000
00:45:00 0.000000
01:00:00 0.000000
01:15:00 0.000000
01:30:00 50.869889
01:45:00 50.869889
02:00:00 50.869889
02:15:00 50.869889
02:30:00 50.869889
02:45:00 50.869889
03:00:00 50.869889
03:15:00 8.846017
03:30:00 8.846017
03:45:00 8.846017
04:00:00 8.846017
04:15:00 8.846017
04:30:00 8.846017
04:45:00 8.846017
Let's say I have a series of instantaneous temperature measurements (i.e. they capture the temperature at an exact moment in time).
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
Out[130]:
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
I want to get a average of daily temperature. The problem is that I want to include 00:00:00 from the current day and the next day in the average for the current day. For example I want to average 2000-01-01 00:00:00 to 2000-01-02 00:00:00 inclusive. The pandas resample function will not include 2000-01-02 in the bin because it's a different day.
I would imagine this situation comes up often when dealing with instantaneous measurements that need to be resampled. What's the solution?
setup
index = pd.date_range('1/1/2000', periods=9, freq='6H')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
solution
series.rolling(5).mean().resample('D').first()
2000-01-01 NaN
2000-01-02 2.0
2000-01-03 6.0
Freq: D, dtype: float64
I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?
resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]
I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.
Problem
Problem set up
#%% Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y
Desired output
Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.
Possible solutions
Simply use the 15 minutes resampling and just live with the small introduced error.
Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.
Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array
Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.
Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation
Edit: #Paul H gave a workable solution along these lines, which is stille readable. Thanks!
All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.
Implementation
The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.
#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')
#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0
j = 0
k = 0
for val in ydesired:
if i+k==len(y): k=0
ydesired[j] = np.mean([y[i],y[i+k]])
j+=1
i+=1
if k==0: k=1;
else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
#%% finalize plot
plt.legend()
Implementation for angles
As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.
#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')
#%% convert complex to degrees
def f(x):
return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0]
#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360
#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according #Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()
I might be misunderstanding the problem, but does this work?
TL;DR version:
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])
Long version
setup fake data
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)
A
2012-01-01 00:00:00 0
2012-01-01 00:10:00 8
2012-01-01 00:20:00 16
2012-01-01 00:30:00 24
2012-01-01 00:40:00 32
2012-01-01 00:50:00 40
2012-01-01 01:00:00 48
2012-01-01 01:10:00 56
2012-01-01 01:20:00 64
2012-01-01 01:30:00 72
2012-01-01 01:40:00 80
2012-01-01 01:50:00 88
2012-01-01 02:00:00 96
So then build a new 5-minute index and reindex the original dataframe
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00 8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00 16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00 24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00 32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00 40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00 48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00 56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00 64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00 72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00 80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00 88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00 96
and then linearly interpolate
print(df2.interpolate())
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 4
2012-01-01 00:10:00 8
2012-01-01 00:15:00 12
2012-01-01 00:20:00 16
2012-01-01 00:25:00 20
2012-01-01 00:30:00 24
2012-01-01 00:35:00 28
2012-01-01 00:40:00 32
2012-01-01 00:45:00 36
2012-01-01 00:50:00 40
2012-01-01 00:55:00 44
2012-01-01 01:00:00 48
2012-01-01 01:05:00 52
2012-01-01 01:10:00 56
2012-01-01 01:15:00 60
2012-01-01 01:20:00 64
2012-01-01 01:25:00 68
2012-01-01 01:30:00 72
2012-01-01 01:35:00 76
2012-01-01 01:40:00 80
2012-01-01 01:45:00 84
2012-01-01 01:50:00 88
2012-01-01 01:55:00 92
2012-01-01 02:00:00 96
build a 15-minute index and use that to pull out data:
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])
A
2012-01-01 00:00:00 0
2012-01-01 00:15:00 12
2012-01-01 00:30:00 24
2012-01-01 00:45:00 36
2012-01-01 01:00:00 48
2012-01-01 01:15:00 60
2012-01-01 01:30:00 72
2012-01-01 01:45:00 84
2012-01-01 02:00:00 96
Ok, here's one way to do it.
Make a list of the times you want to have filled in
Make a combined index that includes the times you want and the times you already have
Take your data and "forward fill it"
Take your data and "backward fill it"
Average the forward and backward fills
Select only the rows you want
Note this only works since you want the values exactly halfway between the values you already have, time-wise. Note the last time comes out np.nan because you don't have any later data.
times_15 = []
current = df.index[0]
while current < df.index[-2]:
current = current + dt.timedelta(minutes=15)
times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]
In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.
If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.
import numpy as np
import pandas
data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
A
2022-01-01 00:03:00 9
2022-01-01 00:06:00 18
2022-01-01 00:08:00 24
2022-01-01 00:18:00 54
2022-01-01 00:25:00 75
2022-01-01 00:27:00 81
2022-01-01 00:30:00 90
Notice resampling this DF without any regularity forces values to the floor index, without interpolating.
print(df1.resample('05T').mean())
A
2022-01-01 00:00:00 9.0
2022-01-01 00:05:00 24.0
2022-01-01 00:10:00 39.0
2022-01-01 00:15:00 51.0
2022-01-01 00:20:00 NaN
2022-01-01 00:25:00 79.5
A better solution can be achieved by interpolating in a small enough interval and then resampling. The result DF now has too much, but a dropna() brings it close to its original shape.
index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())
A A_interp
2022-01-01 00:00:00 9.0 9.0
2022-01-01 00:05:00 21.0 15.0
2022-01-01 00:10:00 39.0 30.0
2022-01-01 00:15:00 51.0 45.0
2022-01-01 00:25:00 75.0 75.0