Unable to combine date and time in pandas - python

I would like to combine the following date and time columns to 1 date_time column:
casinghourly[['Date','Time']].head()
Out[275]:
Date Time
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
I've used the following code:
casinghourly.loc[:,'Date_Time'] = pd.to_datetime(casinghourly.Date.astype(str)+' '+casinghourly.Time.astype(str))
But I get the following error:
ValueError: Unknown string format
Fyi:
casinghourly[['Date','Time']].dtypes
Out[276]:
Date datetime64[ns]
Time timedelta64[ns]
dtype: object
Can somebody help me here please?

You can directly concat datetime64[ns] with timedelta64[ns]:
df['Date'] = df['Date']+df['Time']
print(df['Date'])
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
Name: Date, dtype: datetime64[ns]
print(df)
Date Time
0 2014-01-01 00:00:00 00:00:00
1 2014-01-01 01:00:00 01:00:00
2 2014-01-01 02:00:00 02:00:00
3 2014-01-01 03:00:00 03:00:00
4 2014-01-01 04:00:00 04:00:00
print(df.dtypes)
Date datetime64[ns]
Time timedelta64[ns]
dtype: object

Related

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

how can i get conditonal hourly mean in pandas?

i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64

Pandas timeseries groupby using TimeGrouper

I have a time series which is like this
Time Demand
Date
2014-01-01 0:00 2899.0
2014-01-01 0:15 2869.0
2014-01-01 0:30 2827.0
2014-01-01 0:45 2787.0
2014-01-01 1:00 2724.0
2014-01-01 1:15 2687.0
2014-01-01 1:30 2596.0
2014-01-01 1:45 2543.0
2014-01-01 2:00 2483.0
Its is in 15 minute increments. I want the average for every hour of everyday.So i tried something like this df.groupby(pd.TimeGrouper(freq='H')).mean(). It didn't work out quite right because it returned mostlyNaNs.
Now my dataset has data like this for the whole year and I would like to calculate the mean for all the hours of all the months such that I have 24 points but the mean is for all hours of the year e.g. the first hour get the mean of the first hour for all the months. The expected output would be
2014 00:00:00 2884.0
2014 01:00:00 2807.0
2014 02:00:00 2705.5
2014 03:00:00 2569.5
..........
2014 23:00:00 2557.5
How can I achieve this?
I think you need first add Time column to index:
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 2899.0
2014-01-01 00:15:00 0:15 2869.0
2014-01-01 00:30:00 0:30 2827.0
2014-01-01 00:45:00 0:45 2787.0
2014-01-01 01:00:00 1:00 2724.0
2014-01-01 01:15:00 1:15 2687.0
2014-01-01 01:30:00 1:30 2596.0
2014-01-01 01:45:00 1:45 2543.0
2014-01-01 02:00:00 2:00 2483.0
print (df.groupby(pd.Grouper(freq='H')).mean())
#same as
#print (df.groupby(pd.TimeGrouper(freq='H')).mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
Thanks pansen for another idea resample:
print (df.resample("H").mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
EDIT:
print (df)
Time Demand
Date
2014-01-01 0:00 1.0
2014-01-01 0:15 2.0
2014-01-01 0:30 4.0
2014-01-01 0:45 5.0
2014-01-01 1:00 1.0
2014-01-01 1:15 0.0
2015-01-01 1:30 1.0
2015-01-01 1:45 2.0
2015-01-01 2:00 3.0
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 1.0
2014-01-01 00:15:00 0:15 2.0
2014-01-01 00:30:00 0:30 4.0
2014-01-01 00:45:00 0:45 5.0
2014-01-01 01:00:00 1:00 1.0
2014-01-01 01:15:00 1:15 0.0
2015-01-01 01:30:00 1:30 1.0
2015-01-01 01:45:00 1:45 2.0
2015-01-01 02:00:00 2:00 3.0
df1 = df.groupby([df.index.year, df.index.hour]).mean().reset_index()
df1.columns = ['year','hour','Demand']
print (df1)
year hour Demand
0 2014 0 3.0
1 2014 1 0.5
2 2015 1 1.5
3 2015 2 3.0
For DatetimeIndex use:
df1 = df.groupby([df.index.year, df.index.hour]).mean()
df1.index = pd.to_datetime(df1.index.get_level_values(0).astype(str) +
df1.index.get_level_values(1).astype(str), format='%Y%H')
print (df1)
Demand
2014-01-01 00:00:00 3.0
2014-01-01 01:00:00 0.5
2015-01-01 01:00:00 1.5
2015-01-01 02:00:00 3.0

Pandas computer hourly average and set at middle of interval

I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64

Find recurring events in time series with pandas

I have a time series of events and I would like to count previous non-consecutive occurrences of each type of event in the time series. I want to do this with pandas. I could do it iterating through the items, but I wonder if there is a clever way of doing it w/o loops.
To make it clearer. Consider the following time series:
dates = pd.date_range('1/1/2011', periods=4, freq='H')
data = ['a', 'a', 'b', 'a']
df = pd.DataFrame(data,index=dates,columns=["event"])
event
2011-01-01 00:00:00 a
2011-01-01 01:00:00 a
2011-01-01 02:00:00 b
2011-01-01 03:00:00 a
I would like to add a new column that tells, for each element in the "event" column, how many non-consecutive times that element has previously appeared. That is, something like this:
event #prev-occurr
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1
We don't really have good groupby support for contiguous groups yet, but we can use the shift-compare-cumsum pattern and then a dense rank to get what you need, IIUC:
>>> egroup = (df["event"] != df["event"].shift()).cumsum()
>>> df["prev_occur"] = egroup.groupby(df["event"]).rank(method="dense") - 1
>>> df
event prev_occur
2011-01-01 00:00:00 a 0
2011-01-01 01:00:00 a 0
2011-01-01 02:00:00 b 0
2011-01-01 03:00:00 a 1
2011-01-01 04:00:00 a 1
2011-01-01 05:00:00 b 1
2011-01-01 06:00:00 a 2
This works because we get a contiguous event group count:
>>> egroup
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 05:00:00 4
2011-01-01 06:00:00 5
Freq: H, Name: event, dtype: int64
and then we can group this by the event types, giving us the non-ranked version:
>>> for k,g in egroup.groupby(df["event"]):
... print(g)
...
2011-01-01 00:00:00 1
2011-01-01 01:00:00 1
2011-01-01 03:00:00 3
2011-01-01 04:00:00 3
2011-01-01 06:00:00 5
Name: event, dtype: int64
2011-01-01 02:00:00 2
2011-01-01 05:00:00 4
Name: event, dtype: int64
which we can finally do a dense rank on.

Categories

Resources