Pandas: Generate Sequential Timestamp with Jump - python

I have a df with the following index
df.index
>>> [2010-01-04 10:00:00, ..., 2010-12-31 16:00:00]
The main column is volume.
In the timestamp sequence, weekends and some other weekdays are not present. I want to resample my time index to have the aggregate sum of volume per minute. So I do the following:
df = df.resample('60S', how=sum)
There are some missing minutes. In other words, there are minutes where there are no trades. I want to include these missing minutes and add a 0 to the column volume.
To solve this, I would usually do something like:
new_range = pd.date_range('20110104 09:30:00','20111231 16:00:00',
freq='60s')+df.index
df = df.reindex(new_range)
df = df.between_time(start_time='10:00', end_time='16:00') # time interval per day that I want
df = df.fillna(0)
But now I am stuck with unwanted dates like the weekends and some other days. How can I get rid of the dates that were not originally in my timestamp index?

Just construct the range of datetimes you want and reindex to it.
Entire range
In [9]: rng = pd.date_range('20130101 09:00','20130110 16:00',freq='30T')
In [10]: rng
Out[10]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-10 16:00:00]
Length: 447, Freq: 30T, Timezone: None
Eliminate times out of range
In [11]: rng = rng.take(rng.indexer_between_time('09:30','16:00'))
In [12]: rng
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:30:00, ..., 2013-01-10 16:00:00]
Length: 140, Freq: None, Timezone: None
Eliminate non-weekdays
In [13]: rng = rng[rng.weekday<5]
In [14]: rng
Out[14]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:30:00, ..., 2013-01-10 16:00:00]
Length: 112, Freq: None, Timezone: None
Just looking at the values, you prob want df.reindex(index=rng)
In [15]: rng.to_series()
Out[15]:
2013-01-01 09:30:00 2013-01-01 09:30:00
2013-01-01 10:00:00 2013-01-01 10:00:00
2013-01-01 10:30:00 2013-01-01 10:30:00
2013-01-01 11:00:00 2013-01-01 11:00:00
2013-01-01 11:30:00 2013-01-01 11:30:00
2013-01-01 12:00:00 2013-01-01 12:00:00
2013-01-01 12:30:00 2013-01-01 12:30:00
2013-01-01 13:00:00 2013-01-01 13:00:00
2013-01-01 13:30:00 2013-01-01 13:30:00
2013-01-01 14:00:00 2013-01-01 14:00:00
2013-01-01 14:30:00 2013-01-01 14:30:00
2013-01-01 15:00:00 2013-01-01 15:00:00
2013-01-01 15:30:00 2013-01-01 15:30:00
2013-01-01 16:00:00 2013-01-01 16:00:00
2013-01-02 09:30:00 2013-01-02 09:30:00
...
2013-01-09 16:00:00 2013-01-09 16:00:00
2013-01-10 09:30:00 2013-01-10 09:30:00
2013-01-10 10:00:00 2013-01-10 10:00:00
2013-01-10 10:30:00 2013-01-10 10:30:00
2013-01-10 11:00:00 2013-01-10 11:00:00
2013-01-10 11:30:00 2013-01-10 11:30:00
2013-01-10 12:00:00 2013-01-10 12:00:00
2013-01-10 12:30:00 2013-01-10 12:30:00
2013-01-10 13:00:00 2013-01-10 13:00:00
2013-01-10 13:30:00 2013-01-10 13:30:00
2013-01-10 14:00:00 2013-01-10 14:00:00
2013-01-10 14:30:00 2013-01-10 14:30:00
2013-01-10 15:00:00 2013-01-10 15:00:00
2013-01-10 15:30:00 2013-01-10 15:30:00
2013-01-10 16:00:00 2013-01-10 16:00:00
Length: 112
You could also start with a constructed business day freq series (and/or add custom business day if you want holidays, new in 0.14.0, see here

Related

Generating list of 5 minute interval between two times

I have the following strings:
start = "07:00:00"
end = "17:00:00"
How can I generate a list of 5 minute interval between those times, ie
["07:00:00","07:05:00",...,"16:55:00","17:00:00"]
This works for me, I'm sure you can figure out how to put the results in the list instead of printing them out:
>>> import datetime
>>> start = "07:00:00"
>>> end = "17:00:00"
>>> delta = datetime.timedelta(minutes=5)
>>> start = datetime.datetime.strptime( start, '%H:%M:%S' )
>>> end = datetime.datetime.strptime( end, '%H:%M:%S' )
>>> t = start
>>> while t <= end :
... print datetime.datetime.strftime( t, '%H:%M:%S')
... t += delta
...
07:00:00
07:05:00
07:10:00
07:15:00
07:20:00
07:25:00
07:30:00
07:35:00
07:40:00
07:45:00
07:50:00
07:55:00
08:00:00
08:05:00
08:10:00
08:15:00
08:20:00
08:25:00
08:30:00
08:35:00
08:40:00
08:45:00
08:50:00
08:55:00
09:00:00
09:05:00
09:10:00
09:15:00
09:20:00
09:25:00
09:30:00
09:35:00
09:40:00
09:45:00
09:50:00
09:55:00
10:00:00
10:05:00
10:10:00
10:15:00
10:20:00
10:25:00
10:30:00
10:35:00
10:40:00
10:45:00
10:50:00
10:55:00
11:00:00
11:05:00
11:10:00
11:15:00
11:20:00
11:25:00
11:30:00
11:35:00
11:40:00
11:45:00
11:50:00
11:55:00
12:00:00
12:05:00
12:10:00
12:15:00
12:20:00
12:25:00
12:30:00
12:35:00
12:40:00
12:45:00
12:50:00
12:55:00
13:00:00
13:05:00
13:10:00
13:15:00
13:20:00
13:25:00
13:30:00
13:35:00
13:40:00
13:45:00
13:50:00
13:55:00
14:00:00
14:05:00
14:10:00
14:15:00
14:20:00
14:25:00
14:30:00
14:35:00
14:40:00
14:45:00
14:50:00
14:55:00
15:00:00
15:05:00
15:10:00
15:15:00
15:20:00
15:25:00
15:30:00
15:35:00
15:40:00
15:45:00
15:50:00
15:55:00
16:00:00
16:05:00
16:10:00
16:15:00
16:20:00
16:25:00
16:30:00
16:35:00
16:40:00
16:45:00
16:50:00
16:55:00
17:00:00
>>>
Try:
# import modules
from datetime import datetime, timedelta
# Create starting and end datetime object from string
start = datetime.strptime("07:00:00", "%H:%M:%S")
end = datetime.strptime("17:00:00", "%H:%M:%S")
# min_gap
min_gap = 5
# compute datetime interval
arr = [(start + timedelta(hours=min_gap*i/60)).strftime("%H:%M:%S")
for i in range(int((end-start).total_seconds() / 60.0 / min_gap))]
print(arr)
# ['07:00:00', '07:05:00', '07:10:00', '07:15:00', '07:20:00', '07:25:00', '07:30:00', ..., '16:55:00']
Explanations:
First, you need to convert string date to datetime object. The strptime does it!
Then, we will find the number of minutes between the starting date and the ending datetime. This discussion solved it! We can do it like this :
(end-start).total_seconds() / 60.0
However, in our case, we only want to iterate every n minutes. So, in our loop, we need to divide it by n.
Also, as we will iterate over this number of minutes, we need to convertit to int for the for loop. That results in:
int((end-start).total_seconds() / 60.0 / min_gap)
Then, on each element of our loop, we will add the number of minutes to the initial datetime. The tiemdelta function has been designed for. As parameter, we specify the number of hours we want to add : min_gap*i/60.
Finally, we convert this datetime object back to a string object using the strftime.

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

iterate over days (pandas)

How can I iterate over days in the dataframe in pandas?
Example:
My dataframe:
time consumption
time
2016-10-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-10-17 10:00:00 2016-10-17 10:00:00 2135.966666
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666
Pseudocode:
for day in df:
print day
First iteration return:
time consumption
time
2016-10-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-10-17 10:00:00 2016-10-17 10:00:00 2135.966666
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
Second iteration return:
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
Third iteration return :
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666
Use groupby by date, what is a bit different as day:
#groupby by index date
for idx, day in df.groupby(df.index.date):
print (day)
time consumption
time
2016-10-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-10-17 10:00:00 2016-10-17 10:00:00 2135.966666
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
time consumption
time
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
time consumption
time
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666
Or:
#groupby by column time
for idx, day in df.groupby(df.time.dt.date):
print (day)
time consumption
time
2016-10-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-10-17 10:00:00 2016-10-17 10:00:00 2135.966666
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
time consumption
time
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
time consumption
time
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666
Differences can check in first 2 rows are changed with different month:
for idx, day in df.groupby(df.index.day):
print (day)
time consumption
time
2016-09-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-09-17 10:00:00 2016-10-17 10:00:00 2135.966666
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
time consumption
time
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
time consumption
time
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666
for idx, day in df.groupby(df.index.date):
print (day)
time consumption
time
2016-09-17 09:00:00 2016-10-17 09:00:00 2754.483333
2016-09-17 10:00:00 2016-10-17 10:00:00 2135.966666
time consumption
time
2016-10-17 11:00:00 2016-10-17 11:00:00 1497.716666
2016-10-17 12:00:00 2016-10-17 12:00:00 448.100000
time consumption
time
2016-10-24 09:00:00 2016-10-24 09:00:00 1527.716666
2016-10-24 10:00:00 2016-10-24 10:00:00 1219.833333
2016-10-24 11:00:00 2016-10-24 11:00:00 1284.350000
2016-10-24 12:00:00 2016-10-24 12:00:00 14195.633333
time consumption
time
2016-10-31 09:00:00 2016-10-31 09:00:00 2120.933333
2016-10-31 10:00:00 2016-10-31 10:00:00 1630.700000
2016-10-31 11:00:00 2016-10-31 11:00:00 1241.866666
2016-10-31 12:00:00 2016-10-31 12:00:00 1156.266666

Pandas filtering values in dataframe

I have this dataframe. The columns represent the highs and the lows in daily EURUSD price:
df.low df.high
2013-01-17 16:00:00 1.33394 2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805 2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962 2013-01-21 09:00:00 1.33321
2013-01-22 11:00:00 1.32667 2013-01-22 09:00:00 1.33715
2013-01-23 17:00:00 1.32645 2013-01-23 14:00:00 1.33545
2013-01-24 10:00:00 1.32860 2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497 2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246 2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143 2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820 2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411 2013-01-31 17:00:00 1.35944
I summed them up into a third column (df.extremes).
df.extremes
2013-01-17 16:00:00 1.33394
2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805
2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962
2013-01-21 09:00:00 1.33321
2013-01-22 09:00:00 1.33715
2013-01-22 11:00:00 1.32667
2013-01-23 14:00:00 1.33545
2013-01-23 17:00:00 1.32645
2013-01-24 10:00:00 1.32860
2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497
2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246
2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143
2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820
2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411
2013-01-31 17:00:00 1.35944
But now i want to filter some values from df.extremes.
To explain what to filter i try with this "pseudocode":
IF following the index we move from: previous df.low --> df.low --> df.high:
IF df.low > previous df.low: delete df.low
IF df.low < previous df.low: delete previous df.low
If i try to work this out with a for loop, it gives me a KeyError: 1.3339399999999999.
day = df.groupby(pd.TimeGrouper('D'))
is_day_min = day.extremes.apply(lambda x: x == x.min())
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] > df.extremes[i+1]:
del df.extremes[i]
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] < df.extremes[i+1]:
del df.extremes[i+1]
How to filter/delete the values as i explained in pseudocode?
I am struggling with indexing and bools but i can't solve this. I strongly suspect that i need to use a lambda function, but i don't know how to apply it. So please have mercy it's too long that i'm trying on this. Hope i've been clear enough.
All you're really missing is a way of saying "previous low" in a vectorized fashion. That's spelled df['low'].shift(-1). Once you have that it's just:
prev = df.low.shift(-1)
filtered_df = df[~((df.low > prev) | (df.low < prev))]

python pandas date_range when importing with read_csv

I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.

Categories

Resources