I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.
Related
I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64
I'm using Pandas 0.17.1 and I oftentimes encounter hourly Series data that contains partial days. It does not seem that there is any functionality built into pandas that permits you to discard values that correspond to incomplete segments of a coarser date offset on the boundaries of the Series data (I would only like to discard partial data that exist at the beginning and/or the end of the Series).
My intuition, given the above, is that I would have to code something up to abstract the criterion (e.g. groupby with a count aggregation, discard hours in days with < 24 hours):
>> hist_data.groupby(lambda x: x.date()).agg('count')
2007-01-01 23
2007-01-02 24
...
An example of desired behavior:
>> hourly_data
2016-01-01 04:00:00 0.603820
2016-01-01 05:00:00 0.806696
2016-01-01 06:00:00 0.938521
2016-01-01 07:00:00 0.781834
2016-01-01 08:00:00 0.154952
...
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
2016-01-04 00:00:00 0.458402
2016-01-04 01:00:00 0.649496
2016-01-04 02:00:00 0.525321
2016-01-04 03:00:00 0.242605
Freq: H, dtype: float64
>> remove_partial_boundary_data(hourly_data)
2016-01-02 00:00:00 0.833063
2016-01-02 01:00:00 0.131586
2016-01-02 02:00:00 0.876609
2016-01-02 03:00:00 0.319436
2016-01-02 04:00:00 0.056246
...
2016-01-03 20:00:00 0.405725
2016-01-03 21:00:00 0.541096
2016-01-03 22:00:00 0.082177
2016-01-03 23:00:00 0.753210
Freq: H, dtype: float64
However, if my timezone is anything other than UTC (timezone-aware), the approach suggested above seems fraught with peril because counts of hours on DST transition days would be either 23 or 25.
Does anyone know of a clever or built-in way to handle this?
You can do this with a groupby and filter on groups that are not complete. To check for completeness, I first reindexed the data and then checked if there are NaN values:
In [10]: hourly_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01 04:00', periods=72, freq='H'))
In [11]: new_idx = pd.date_range(hourly_data.index[0].date(), hourly_data.index[-1].date() + pd.Timedelta('1 day'), freq='H')
In [12]: hourly_data.reindex(new_idx)
Out[12]:
2016-01-01 00:00:00 NaN
2016-01-01 01:00:00 NaN
2016-01-01 02:00:00 NaN
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 -0.941332
2016-01-01 05:00:00 1.802739
2016-01-01 06:00:00 0.798968
2016-01-01 07:00:00 -0.444979
...
2016-01-04 17:00:00 NaN
2016-01-04 18:00:00 NaN
2016-01-04 19:00:00 NaN
2016-01-04 20:00:00 NaN
2016-01-04 21:00:00 NaN
2016-01-04 22:00:00 NaN
2016-01-04 23:00:00 NaN
2016-01-05 00:00:00 NaN
Freq: H, dtype: float64
This resulted in a timeseries that includes all hours of the dates that are present in the timeseries. This way, we can check if a date was complete by checking if there are NaN values for that date (this method should work for DST transitions), and filter with this criterion:
In [13]: hourly_data.reindex(new_idx).groupby(lambda x: x.date()).filter(lambda x: x.isnull().sum() == 0)
Out[13]:
2016-01-02 00:00:00 -1.231445
2016-01-02 01:00:00 2.371690
2016-01-02 02:00:00 -0.695448
2016-01-02 03:00:00 0.745308
2016-01-02 04:00:00 0.814579
2016-01-02 05:00:00 1.345674
2016-01-02 06:00:00 -1.491470
2016-01-02 07:00:00 0.407182
...
2016-01-03 16:00:00 -0.742151
2016-01-03 17:00:00 0.677229
2016-01-03 18:00:00 0.832271
2016-01-03 19:00:00 -0.183729
2016-01-03 20:00:00 1.938594
2016-01-03 21:00:00 -0.816370
2016-01-03 22:00:00 1.745757
2016-01-03 23:00:00 0.223487
Freq: H, dtype: float64
ORIGINAL ANSWER
You can do this with resample by providing a custom function, and in that function you can then specify that NaN values should not be skipped.
Short answer:
hist_data.resample('D', how=lambda x: x.mean(skipna=False))
if the missing hours are already present as NaNs. Otherwise, you can first resample it to a regular hourly series:
hist_data.resample('H').resample('D', how=lambda x: x.mean(skipna=False))
Long answer with an example. With some dummy data (and I insert a NaN in one of the days):
In [77]: hist_data = pd.Series(np.random.randn(72), index=pd.date_range('2016-01-01', periods=72, freq='H'))
In [78]: hist_data
Out[78]:
2016-01-01 00:00:00 -0.717624
2016-01-01 01:00:00 0.029151
2016-01-01 02:00:00 0.535843
...
2016-01-03 21:00:00 0.659923
2016-01-03 22:00:00 -1.085640
2016-01-03 23:00:00 0.571347
Freq: H, dtype: float64
In [80]: hist_data.iloc[30] = np.nan
With count you can see that there is indeed one missing value for the second day:
In [81]: hist_data.resample('D', how='count')
Out[81]:
2016-01-01 24
2016-01-02 23
2016-01-03 24
Freq: D, dtype: int64
By default, 'mean' will ignore this NaN value:
In [83]: hist_data.resample('D', how='mean')
Out[83]:
2016-01-01 0.106537
2016-01-02 -0.112774
2016-01-03 -0.292248
Freq: D, dtype: float64
But you can change this behaviour with the skipna keyword argument:
In [82]: hist_data.resample('D', how=lambda x: x.mean(skipna=False))
Out[82]:
2016-01-01 0.106537
2016-01-02 NaN
2016-01-03 -0.292248
Freq: D, dtype: float64
I have a df with the following index
df.index
>>> [2010-01-04 10:00:00, ..., 2010-12-31 16:00:00]
The main column is volume.
In the timestamp sequence, weekends and some other weekdays are not present. I want to resample my time index to have the aggregate sum of volume per minute. So I do the following:
df = df.resample('60S', how=sum)
There are some missing minutes. In other words, there are minutes where there are no trades. I want to include these missing minutes and add a 0 to the column volume.
To solve this, I would usually do something like:
new_range = pd.date_range('20110104 09:30:00','20111231 16:00:00',
freq='60s')+df.index
df = df.reindex(new_range)
df = df.between_time(start_time='10:00', end_time='16:00') # time interval per day that I want
df = df.fillna(0)
But now I am stuck with unwanted dates like the weekends and some other days. How can I get rid of the dates that were not originally in my timestamp index?
Just construct the range of datetimes you want and reindex to it.
Entire range
In [9]: rng = pd.date_range('20130101 09:00','20130110 16:00',freq='30T')
In [10]: rng
Out[10]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-10 16:00:00]
Length: 447, Freq: 30T, Timezone: None
Eliminate times out of range
In [11]: rng = rng.take(rng.indexer_between_time('09:30','16:00'))
In [12]: rng
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:30:00, ..., 2013-01-10 16:00:00]
Length: 140, Freq: None, Timezone: None
Eliminate non-weekdays
In [13]: rng = rng[rng.weekday<5]
In [14]: rng
Out[14]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:30:00, ..., 2013-01-10 16:00:00]
Length: 112, Freq: None, Timezone: None
Just looking at the values, you prob want df.reindex(index=rng)
In [15]: rng.to_series()
Out[15]:
2013-01-01 09:30:00 2013-01-01 09:30:00
2013-01-01 10:00:00 2013-01-01 10:00:00
2013-01-01 10:30:00 2013-01-01 10:30:00
2013-01-01 11:00:00 2013-01-01 11:00:00
2013-01-01 11:30:00 2013-01-01 11:30:00
2013-01-01 12:00:00 2013-01-01 12:00:00
2013-01-01 12:30:00 2013-01-01 12:30:00
2013-01-01 13:00:00 2013-01-01 13:00:00
2013-01-01 13:30:00 2013-01-01 13:30:00
2013-01-01 14:00:00 2013-01-01 14:00:00
2013-01-01 14:30:00 2013-01-01 14:30:00
2013-01-01 15:00:00 2013-01-01 15:00:00
2013-01-01 15:30:00 2013-01-01 15:30:00
2013-01-01 16:00:00 2013-01-01 16:00:00
2013-01-02 09:30:00 2013-01-02 09:30:00
...
2013-01-09 16:00:00 2013-01-09 16:00:00
2013-01-10 09:30:00 2013-01-10 09:30:00
2013-01-10 10:00:00 2013-01-10 10:00:00
2013-01-10 10:30:00 2013-01-10 10:30:00
2013-01-10 11:00:00 2013-01-10 11:00:00
2013-01-10 11:30:00 2013-01-10 11:30:00
2013-01-10 12:00:00 2013-01-10 12:00:00
2013-01-10 12:30:00 2013-01-10 12:30:00
2013-01-10 13:00:00 2013-01-10 13:00:00
2013-01-10 13:30:00 2013-01-10 13:30:00
2013-01-10 14:00:00 2013-01-10 14:00:00
2013-01-10 14:30:00 2013-01-10 14:30:00
2013-01-10 15:00:00 2013-01-10 15:00:00
2013-01-10 15:30:00 2013-01-10 15:30:00
2013-01-10 16:00:00 2013-01-10 16:00:00
Length: 112
You could also start with a constructed business day freq series (and/or add custom business day if you want holidays, new in 0.14.0, see here
I have a pandas dataframe indexed by time:
>>> dframe.head()
aw_FATFREEMASS raw aw_FATFREEMASS sym
TIMESTAMP
2011-12-08 23:13:23 139.3 H
2011-12-08 23:12:18 139.2 H
2011-12-08 22:31:53 139.2 H
2011-12-09 07:08:50 138.2 H
2011-12-10 21:36:20 137.6 H
[5 rows x 2 columns]
>>> type(dframe.index)
<class 'pandas.tseries.index.DatetimeIndex'>
I'm trying to do a simple time series query similar to this SQL:
SELECT * FROM dframe WHERE tstart <= TIMESTAMP <= tend
where tstart and tend are appropriately represented timestamps. With pandas I'm getting behavior I just don't understand.
This does what I expect:
>>> dframe['2011-11-01' : '2011-11-20']
Empty DataFrame
Columns: [aw_FATFREEMASS raw, aw_FATFREEMASS sym]
Index: []
[0 rows x 2 columns]
This does the same thing:
dframe['2011-11-01 00:00:00' : '2011-11-20 00:00:00']
However:
>>> from dateutil.parser import parse
>>> dframe[parse('2011-11-01 00:00:00') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : parse('2011-11-01')]
*** KeyError: Timestamp('2011-11-01 00:00:00', tz=None)
When I provide a time represented as a pandas Timestamp I get slice behavior I don't understand. Can someone explain this behavior and/or tell me how I can achieve the SQL query above?
docs are here
This is called partial string indexing. In a nutshell, providing a string will get you results that 'match', e.g. they are included in the specified interval, while if you specify a Timestamp/datetime then its exact; it HAS to be in the index.
Can you show how you constructed the DatetimeIndex?
what version pandas?
In [4]: df = DataFrame(np.random.randn(20,2),index=date_range('20130101',periods=20,freq='H'))
In [5]: df
Out[5]:
0 1
2013-01-01 00:00:00 -0.339751 1.223660
2013-01-01 01:00:00 0.525203 -0.987815
2013-01-01 02:00:00 1.724239 0.213446
2013-01-01 03:00:00 -0.074797 -1.658876
2013-01-01 04:00:00 0.483425 -2.112314
2013-01-01 05:00:00 0.094140 0.327681
2013-01-01 06:00:00 -1.265337 -0.858521
2013-01-01 07:00:00 -1.470041 0.168871
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
2013-01-01 11:00:00 1.070209 0.720150
2013-01-01 12:00:00 0.887571 -0.611207
2013-01-01 13:00:00 1.669451 -0.022434
2013-01-01 14:00:00 -1.796565 -1.186899
2013-01-01 15:00:00 0.417758 0.082021
2013-01-01 16:00:00 -1.064019 -0.377208
2013-01-01 17:00:00 0.939902 0.430784
2013-01-01 18:00:00 -0.645667 1.611992
2013-01-01 19:00:00 -0.172148 -1.725041
[20 rows x 2 columns]
In [6]: df['20130101 7:00:01':'20130101 10:00:00']
Out[6]:
0 1
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
[3 rows x 2 columns]
In [7]: df.index
Out[7]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 19:00:00]
Length: 20, Freq: H, Timezone: None
If you already have Timestamps/datetimes, then just construct a boolean expression
df[(df.index > Timestamp('20130101 10:00:00')) & (df.index < Timestamp('201301010 17:00:00')])
I'd like to create multiple columns while resampling a pandas DataFrame like the built-in ohlc method.
def mhl(data):
return pandas.Series([np.mean(data),np.max(data),np.min(data)],index = ['mean','high','low'])
ts.resample('30Min',how=mhl)
Dies with
Exception: Must produce aggregated value
Any suggestions? Thanks!
You can pass a dictionary of functions to the resample method:
In [35]: ts
Out[35]:
2013-01-01 00:00:00 0
2013-01-01 00:15:00 1
2013-01-01 00:30:00 2
2013-01-01 00:45:00 3
2013-01-01 01:00:00 4
2013-01-01 01:15:00 5
...
2013-01-01 23:00:00 92
2013-01-01 23:15:00 93
2013-01-01 23:30:00 94
2013-01-01 23:45:00 95
2013-01-02 00:00:00 96
Freq: 15T, Length: 97
Create a dictionary of functions:
mhl = {'m':np.mean, 'h':np.max, 'l':np.min}
Pass the dictionary to the how parameter of resample:
In [36]: ts.resample("30Min", how=mhl)
Out[36]:
h m l
2013-01-01 00:00:00 1 0.5 0
2013-01-01 00:30:00 3 2.5 2
2013-01-01 01:00:00 5 4.5 4
2013-01-01 01:30:00 7 6.5 6
2013-01-01 02:00:00 9 8.5 8
2013-01-01 02:30:00 11 10.5 10
2013-01-01 03:00:00 13 12.5 12
2013-01-01 03:30:00 15 14.5 14