I have a dataframe that I grouped with function groupby. In order to do so, I had to use DatetimeIndex. However, I would like to transform my datetimeindex as integer to use it as index for aa dynamic optimization model. I'm able to transform my date time index as float by not as integer differenciating hours.
# My data look like this:
[ Date Hour MktDemand HOEP hour
Datetime
2019-01-01 01:00:00 2019-01-01 1 16231 0.00 0
2019-01-01 02:00:00 2019-01-01 2 16051 0.00 1
2019-01-01 03:00:00 2019-01-01 3 15805 -0.11 2
2019-01-01 04:00:00 2019-01-01 4 15580 -1.84 3
2019-01-01 05:00:00 2019-01-01 5 15609 -0.47 4
...
import datetime as dt
df['Datetime'] = pd.to_datetime(df.Date) + pd.to_timedelta(df.Hour, unit='h')
df['datetime'] = pd.to_datetime(df.Date) + pd.to_timedelta(df.Hour, unit='h')
grouped = df.set_index('Datetime').groupby(pd.Grouper(freq="15d"))
for name, group in grouped:
print(pd.to_numeric(group.index, downcast='integer'))
# It returns this:
Int64Index([1546304400000000000, 1546308000000000000, 1546311600000000000,
1546315200000000000, 1546318800000000000, 1546322400000000000,
1546326000000000000, 1546329600000000000, 1546333200000000000,
1546336800000000000,
...
# However, I would like to have integers in this format:
20190523
20190524
# I tried this but it doesn't work:
for name, group in grouped:
print(pd.to_timedelta(group.index).dt.total_hours().astype(int))
ERROR: dtype datetime64[ns] cannot be converted to timedelta64[ns]
The integers you expect represent a datetime format; they're not an actual numeric representation of datetime (which pd.to_numeric gives you, as nanoseconds since 1970-1-1 UTC).
Therefore, you'll want to format to string and then convert to integer.
Ex:
import pandas as pd
# some synthetic example data...
dti = pd.date_range("2015", "2016", freq='d')
df = pd.DataFrame({'some_value': [i for i in range(len(dti))]})
grouped = df.set_index(dti).groupby(pd.Grouper(freq="15d"))
for name, group in grouped:
print(group.index.strftime('%Y%m%d').astype(int))
# gives you e.g.
Int64Index([20150101, 20150102, 20150103, 20150104, 20150105, 20150106,
20150107, 20150108, 20150109, 20150110, 20150111, 20150112,
20150113, 20150114, 20150115],
dtype='int64')
...
You could also extend the strftime directive to give you additional parameters like hours or minutes.
Related
Say I have this dataframe:
import pandas as pd
import datetime
x = [datetime.time(23,0),datetime.time(6,0),datetime.time(18,0),datetime.time(17,0)]
y = [datetime.time(22,0),datetime.time(9,0),datetime.time(9,0),datetime.time(23,0)]
df = pd.DataFrame({'time1':x,'time2':y})
which looks like this:
How would I compute the absolute difference between the two columns? Subtraction doesn't work. The result should look like this:
df['abs_diff'] = [1,3,9,6]
Thanks so much!
Pandas doesn't like datetime objects so very much; it labels the series as object dtype, so you can't really do any arithmetics on those. You can convert the data to Pandas' timedelta:
df['abs_diff'] = (pd.to_timedelta(df['time1'].astype(str)) # convert to timedelta
.sub(pd.to_timedelta(df['time2'].astype(str))) # then you can subtract
.abs().div(pd.Timedelta('1H')) # and absolute value, and divide
)
Output:
time1 time2 abs_diff
0 23:00:00 22:00:00 1.0
1 06:00:00 09:00:00 3.0
2 18:00:00 09:00:00 9.0
3 17:00:00 23:00:00 6.0
I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]
I have a dataframe with a datetime column. I want to group by the time component only and aggregate, e.g. by taking the mean.
I know that I can use pd.Grouper to group by date AND time, but it doesn't work on time only.
Say we have the following dataframe:
import numpy as np
import pandas as pd
drange = pd.date_range('2019-08-01 00:00', '2019-08-12 12:00', freq='1T')
time = drange.time
c0 = np.random.rand(len(drange))
c1 = np.random.rand(len(drange))
df = pd.DataFrame(dict(drange=drange, time=time, c0=c0, c1=c1))
print(df.head())
drange time c0 c1
0 2019-08-01 00:00:00 00:00:00 0.031946 0.159739
1 2019-08-01 00:01:00 00:01:00 0.809171 0.681942
2 2019-08-01 00:02:00 00:02:00 0.036720 0.133443
3 2019-08-01 00:03:00 00:03:00 0.650522 0.409797
4 2019-08-01 00:04:00 00:04:00 0.239262 0.814565
In this case, the following throws a TypeError:
grouper = pd.Grouper(key='time', freq='5T')
grouped = df.groupby(grouper).mean()
I could set key=drange to group by date and time and then:
Reset the index
Transform the new column to float
Bin with pd.cut
Cast back to time
Finally group-by and then aggregate
... But I wonder whether there is a cleaner way to achieve the same results.
Series.dt.time/DatetimeIndex.time returns the time as datetime.time. This isn't great because pandas works best withtimedelta64 and so your 'time' column is cast to object, losing all datetime functionality.
You can subtract off the normalized date to obtain the time as a timedelta so you can continue to use the datetime tools of pandas. You can floor this to group.
s = (df.drange - df.drange.dt.normalize()).dt.floor('5T')
df.groupby(s).mean()
c0 c1
drange
00:00:00 0.436971 0.530201
00:05:00 0.441387 0.518831
00:10:00 0.465008 0.478130
... ... ...
23:45:00 0.523233 0.515991
23:50:00 0.468695 0.434240
23:55:00 0.569989 0.510291
Alternatively if you feel unsure of floor, this gets the identical output up to the index name
df['time'] = (df.drange - df.drange.dt.normalize()) # timedelta64[ns]
df.groupby(pd.Grouper(key='time', freq='5T')).mean()
When you use DataFrame.groupby you can a Series an argument. Moreover, if your series is a datetime, you can use the series.dt to access the properties of date. In your case df['drange'].dt.hour or df['drange'].dt.time should do it.
# df['drange']=pd.to_datetime(df['drange'])
df.groupby(df['drange'].dt.hour).agg(...)
I have a DataFrame like this:
Date X
....
2014-01-02 07:00:00 16
2014-01-02 07:15:00 20
2014-01-02 07:30:00 21
2014-01-02 07:45:00 33
2014-01-02 08:00:00 22
....
2014-01-02 23:45:00 0
....
1)
So my "Date" Column is a datetime and has values vor every 15min of a day.
What i want is to remove ALL Rows where the time is NOT between 08:00 and 18:00 o'clock.
2)
Some days are missing in the datas...how could i put the missing days in my dataframe and fill them with the value 0 as X.
My approach: Create a new Series between two Dates and set 15min as frequenz and concat my X Column with the new created Series. Is that right?
Edit:
Problem for my second Question:
#create new full DF without missing dates and reindex
full_range = pandas.date_range(start='2014-01-02', end='2017-11-
14',freq='15min')
df = df.reindex(full_range,fill_value=0)
df.head()
Output:
Date X
2014-01-02 00:00:00 1970-01-01 0
2014-01-02 00:15:00 1970-01-01 0
2014-01-02 00:30:00 1970-01-01 0
2014-01-02 00:45:00 1970-01-01 0
2014-01-02 01:00:00 1970-01-01 0
That didnt work as you see.
The "Date" Column is not a index btw. i need it as Column in my df
and why did he take "1970-01-01"? 1970 as year makes no sense to me
What I want is to remove ALL Rows where the time is NOT between 08:00
and 18:00 o'clock.
Create a mask with datetime.time. Example:
from datetime import time
idx = pd.date_range('2014-01-02', freq='15min', periods=10000)
df = pd.DataFrame({'x': np.empty(idx.shape[0])}, index=idx)
t1 = time(8); t2 = time(18)
times = df.index.time
mask = (times > t1) & (times < t2)
df = df.loc[mask]
Some days are missing in the data...how could I put the missing days
in my DataFrame and fill them with the value 0 as X?
Build a date range that doesn't have missing data with pd.date_range() (see above).
Call reindex() on df and specify fill_value=0.
Answering your questions in comments:
np.empty creates an empty array. I was just using it to build some "example" data that is basically garbage. Here idx.shape is the shape of your index (length, width), a tuple. So np.empty(idx.shape[0]) creates an empty 1d array with the same length as idx.
times = df.index.time creates a variable (a NumPy array) called times. df.index.time is the time for each element in the index of df. You can explore this yourself by just breaking the code down in pieces and experimenting with it on your own.
I've a time series that i resampled into this dataframe df ,
My data is from 6th june to 28 june. it want to extend the data from 1st june to 30th june. count column will have 0 value in only extended period and my real values from 6th to 28th.
Out[123]:
count
Timestamp
2009-06-07 02:00:00 1
2009-06-07 03:00:00 0
2009-06-07 04:00:00 0
2009-06-07 05:00:00 0
2009-06-07 06:00:00 0
i need to the make the
start date:2009-06-01 00:00:00
end date:2009-06-30 23:00:00
so the data would look something like this:
count
Timestamp
2009-06-01 01:00:00 0
2009-06-01 02:00:00 0
2009-06-01 03:00:00 0
is there an effective way to perform this. the only way i can think of is not that effective.i am trying this since yesterday. please help
index = pd.date_range('2009-06-01 00:00:00','2009-06-30 23:00:00', freq='H')
df = pandas.DataFrame(numpy.zeros(len(index),1), index=index)
df.columns=['zeros']
result= pd.concat([df2,df])
result1= pd.concat([df,result])
result1.fillna(0)
del result1['zero']
You can create a new index with the desired start and end day/times, resample the time series data and aggregate by count, then set the index to the new index.
import pandas as pd
# create the index with the start and end times you want
t_index = pd.DatetimeIndex(pd.date_range(start='2009-06-01', end='2009-06-30 23:00:00', freq="1h"))
# create the data frame
df = pd.DataFrame([['2009-06-07 02:07:42'],
['2009-06-11 17:25:28'],
['2009-06-11 17:50:42'],
['2009-06-11 17:59:18']], columns=['daytime'])
df['daytime'] = pd.to_datetime(df['daytime'])
# resample the data to 1 hour, aggregate by counts,
# then reset the index and fill the na's with 0
df2 = df.resample('1h', on='daytime').count().reindex(t_index).fillna(0)
DatetimeIndex() no longer works with those arguments, raises __new__() got an unexpected keyword argument 'start'