I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
I have a dataframe with a datetime column. I want to group by the time component only and aggregate, e.g. by taking the mean.
I know that I can use pd.Grouper to group by date AND time, but it doesn't work on time only.
Say we have the following dataframe:
import numpy as np
import pandas as pd
drange = pd.date_range('2019-08-01 00:00', '2019-08-12 12:00', freq='1T')
time = drange.time
c0 = np.random.rand(len(drange))
c1 = np.random.rand(len(drange))
df = pd.DataFrame(dict(drange=drange, time=time, c0=c0, c1=c1))
print(df.head())
drange time c0 c1
0 2019-08-01 00:00:00 00:00:00 0.031946 0.159739
1 2019-08-01 00:01:00 00:01:00 0.809171 0.681942
2 2019-08-01 00:02:00 00:02:00 0.036720 0.133443
3 2019-08-01 00:03:00 00:03:00 0.650522 0.409797
4 2019-08-01 00:04:00 00:04:00 0.239262 0.814565
In this case, the following throws a TypeError:
grouper = pd.Grouper(key='time', freq='5T')
grouped = df.groupby(grouper).mean()
I could set key=drange to group by date and time and then:
Reset the index
Transform the new column to float
Bin with pd.cut
Cast back to time
Finally group-by and then aggregate
... But I wonder whether there is a cleaner way to achieve the same results.
Series.dt.time/DatetimeIndex.time returns the time as datetime.time. This isn't great because pandas works best withtimedelta64 and so your 'time' column is cast to object, losing all datetime functionality.
You can subtract off the normalized date to obtain the time as a timedelta so you can continue to use the datetime tools of pandas. You can floor this to group.
s = (df.drange - df.drange.dt.normalize()).dt.floor('5T')
df.groupby(s).mean()
c0 c1
drange
00:00:00 0.436971 0.530201
00:05:00 0.441387 0.518831
00:10:00 0.465008 0.478130
... ... ...
23:45:00 0.523233 0.515991
23:50:00 0.468695 0.434240
23:55:00 0.569989 0.510291
Alternatively if you feel unsure of floor, this gets the identical output up to the index name
df['time'] = (df.drange - df.drange.dt.normalize()) # timedelta64[ns]
df.groupby(pd.Grouper(key='time', freq='5T')).mean()
When you use DataFrame.groupby you can a Series an argument. Moreover, if your series is a datetime, you can use the series.dt to access the properties of date. In your case df['drange'].dt.hour or df['drange'].dt.time should do it.
# df['drange']=pd.to_datetime(df['drange'])
df.groupby(df['drange'].dt.hour).agg(...)
I've a time series that i resampled into this dataframe df ,
My data is from 6th june to 28 june. it want to extend the data from 1st june to 30th june. count column will have 0 value in only extended period and my real values from 6th to 28th.
Out[123]:
count
Timestamp
2009-06-07 02:00:00 1
2009-06-07 03:00:00 0
2009-06-07 04:00:00 0
2009-06-07 05:00:00 0
2009-06-07 06:00:00 0
i need to the make the
start date:2009-06-01 00:00:00
end date:2009-06-30 23:00:00
so the data would look something like this:
count
Timestamp
2009-06-01 01:00:00 0
2009-06-01 02:00:00 0
2009-06-01 03:00:00 0
is there an effective way to perform this. the only way i can think of is not that effective.i am trying this since yesterday. please help
index = pd.date_range('2009-06-01 00:00:00','2009-06-30 23:00:00', freq='H')
df = pandas.DataFrame(numpy.zeros(len(index),1), index=index)
df.columns=['zeros']
result= pd.concat([df2,df])
result1= pd.concat([df,result])
result1.fillna(0)
del result1['zero']
You can create a new index with the desired start and end day/times, resample the time series data and aggregate by count, then set the index to the new index.
import pandas as pd
# create the index with the start and end times you want
t_index = pd.DatetimeIndex(pd.date_range(start='2009-06-01', end='2009-06-30 23:00:00', freq="1h"))
# create the data frame
df = pd.DataFrame([['2009-06-07 02:07:42'],
['2009-06-11 17:25:28'],
['2009-06-11 17:50:42'],
['2009-06-11 17:59:18']], columns=['daytime'])
df['daytime'] = pd.to_datetime(df['daytime'])
# resample the data to 1 hour, aggregate by counts,
# then reset the index and fill the na's with 0
df2 = df.resample('1h', on='daytime').count().reindex(t_index).fillna(0)
DatetimeIndex() no longer works with those arguments, raises __new__() got an unexpected keyword argument 'start'
Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856
It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.
If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)
I need to convert a timezone aware date_range (TimeStamps) to UNIX epoch values for use in an external Javascript library.
My approach is:
# Create localized test data for one day
rng = pd.date_range('1.1.2014', freq='H', periods=24, tz="Europe/Berlin")
val = np.random.randn(24)
df = pd.DataFrame(data=val, index=rng, columns=['values'])
# Reset index as df column
df = df.reset_index()
# Convert the index column to the desired UNIX epoch format
df['index'] = df['index'].apply(lambda x: x.value // 10**6 )
df['index'] contains the UNIX epoch values as expected but they are are stored in UTC(!).
I suppose this is because pandas stores timestamps in numpy UTC datetime64 values under the hood.
Is there a smart way to get "right" epoch values in the requested time zone?
This proposal doesn't work with DST
In [17]: df
Out[17]:
values
2014-01-01 00:00:00+01:00 1.027799
2014-01-01 01:00:00+01:00 1.579586
2014-01-01 02:00:00+01:00 0.202947
2014-01-01 03:00:00+01:00 -0.214921
2014-01-01 04:00:00+01:00 0.021499
2014-01-01 05:00:00+01:00 -1.368302
2014-01-01 06:00:00+01:00 -0.261738
2014-01-01 22:00:00+01:00 0.808506
2014-01-01 23:00:00+01:00 0.459895
[24 rows x 1 columns]
Use the index method asi8 to convert to int64 (which is already in ns since epoch)
These are the UTC times!
In [18]: df.index.asi8//10**6
Out[18]:
array([1388530800000, 1388534400000, 1388538000000, 1388541600000,
1388545200000, 1388548800000, 1388552400000, 1388556000000,
1388559600000, 1388563200000, 1388566800000, 1388570400000,
1388574000000, 1388577600000, 1388581200000, 1388584800000,
1388588400000, 1388592000000, 1388595600000, 1388599200000,
1388602800000, 1388606400000, 1388610000000, 1388613600000])
These are the local timezone since epoch. Note that this is NOT a public method for normally, I would always exchange UTC data (and the timezone if you need).
In [7]: df.index._local_timestamps()//10**6
Out[7]:
array([1388534400000, 1388538000000, 1388541600000, 1388545200000,
1388548800000, 1388552400000, 1388556000000, 1388559600000,
1388563200000, 1388566800000, 1388570400000, 1388574000000,
1388577600000, 1388581200000, 1388584800000, 1388588400000,
1388592000000, 1388595600000, 1388599200000, 1388602800000,
1388606400000, 1388610000000, 1388613600000, 1388617200000])