Determining Consecutive Days Using Pandas - python

I have a Pandas DataFrame that looks something like this:
activity_id start end type ... site heart_rate grp_idx date
user_id ...
72xyy89c74cc57178e02f103187ad579 dcb12345678b5c8e84cf2931b1a553cb 2015-09-12 00:40:33.171000+00:00 2015-09-12 00:53:33.171000+00:00 run ... data\dcb12345678b5c8e84cf2931b1a553cb.json NaN 2015-37 2015-09-12
The DataFrame has already been filtered such that all user_id is the same.
The dtypes are:
activity_id object
start datetime64[ns, UTC]
end datetime64[ns, UTC]
type object
distance float64
steps float64
speed float64
pace float64
calories float64
ascent float64
descent float64
site object
heart_rate float64
grp_idx object
date object
I need to determine if there are x (e.g. 4) consecutive days in a row and find the number of times that has occurred. For example:
2015-09-12
2015-09-13
2015-09-14
2015-09-15
2015-09-16
2015-09-17
2015-09-18
2015-09-19
Would count as two.
I have tried using groupby, e.g.:
s = repeat_runner_df.groupby('user_id').start.diff().dt.days.fillna(1).cumsum()
repeat_runner_df.groupby(['user_id', s]).filter(lambda x: len(x) < 3)
print(repeat_runner_df)
but that has gotten me nowhere and neither has my Google skills. Any help would be greatly appreciated.
c7e962db02da12345f02fe3d8a86c99d 2018-04-08 03:16:06+00:00 2018-04-08 03:41:30+00:00 run ... NaN 2018-14 2018-04-08
c7e962db02da12345f02fe3d8a86c99d 2018-04-09 17:21:37+00:00 2018-04-09 18:27:17+00:00 run ... NaN 2018-14 2018-04-09
c7e962db02da12345f02fe3d8a86c99d 2018-04-10 19:05:39+00:00 2018-04-10 19:38:32+00:00 run ... NaN 2018-15 2018-04-10

Try this:
count = df.groupby(df.start.diff().ne(pd.Timedelta(days=1)).cumsum()).apply(len).ge(4).sum()
Output:
>>> count
3

I think this is what youre looking for:
consecutive = 4
(df.groupby(['user_id',df.groupby('user_id')['start']
.transform(lambda x: x.diff().dt.days.ne(1).cumsum())])['user_id']
.cumcount().add(1).eq(consecutive).sum())

Related

Getting min value across multiple datetime columns in Pandas

I have the following dataframe
df = pd.DataFrame({
'DATE1': ['NaT', 'NaT', '2010-04-15 19:09:08+00:00', '2011-01-25 15:29:37+00:00', '2010-04-10 12:29:02+00:00', 'NaT'],
'DATE2': ['NaT', 'NaT', 'NaT', 'NaT', '2014-04-10 12:29:02+00:00', 'NaT']})
df.DATE1 = pd.to_datetime(df.DATE1)
df.DATE2 = pd.to_datetime(df.DATE2)
and I would like to create a new column with the minimum value across the two columns (ignoring the NaTs) like so:
df.min(axis=1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
If I remove the timezone information (the +00:00) from every single cell then the desired output is produced like so:
0 NaT
1 NaT
2 2010-04-15 19:09:08
3 2011-01-25 15:29:37
4 2010-04-10 12:29:02
5 NaT
dtype: datetime64[ns]
Why does adding the timezone information break the function? My dataset has timezones so I would need to know how to remove them as a workaround.
This is good question, it should be a bug here with timezone
df.apply(lambda x : np.max(x),1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2014-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]
Odd. Seems like a bug. You could keep the timezone format and use this.
df.apply(lambda x: x.min(),axis=1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2010-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]

pandas: how to get current datetime for NaT

I'm trying to calculate ageing, by doing difference of 2 dynamic dates and converting this into respective hours
here is my code snippet
l_series = getattr(df, field_def['primaryField'])
r_series = getattr(df, field_def['secondaryField'])
result = l_series - r_series
here is my output
4 NaT
14 2020-03-02 13:59:11+00:00
I could understand there is no field updated_at for first record and this is why this is throwing NaT.
Is there a way I can replace the NaT with the current datetime?
Thanks
First idea is replace only by Series.fillna, but because different timezones output is object:
result1 = result.fillna(pd.to_datetime('now'))
print (result1)
4 2020-03-13 10:33:10.557126
14 2020-03-02 13:59:11+00:00
Name: s, dtype: object
For datetimes with timezones add Timestamp.tz_localize:
result2 = result.fillna(pd.to_datetime('now').tz_localize('utc'))
print (result2)
4 2020-03-13 10:33:10.559126+00:00
14 2020-03-02 13:59:11+00:00
Name: s, dtype: datetime64[ns, UTC]

pandas resample: forcing specific start time of time bars

I have some time series data (pandas.DataFrame) and I resample it in '600S' bars:
import numpy as np
data.resample('600S', level='time').aggregate({'abc':np.sum})
I get something like this:
abc
time
09:30:01.446000 19836
09:40:01.446000 8577
09:50:01.446000 29746
10:00:01.446000 29340
10:10:01.446000 5197
...
How can I force the time bars to start at 09:30:00.000000 instead of at the time of the 1st row in the data? I.e. output should be something like this:
abc
time
09:30:00.000000 *****
09:40:00.000000 ****
09:50:00.000000 *****
10:00:00.000000 *****
10:10:00.000000 ****
...
Thank you for your help!
You can add Series.dt.floor to your code:
df.time = df.time.dt.floor('10 min')
time abc
0 2018-12-05 09:30:00 19836
1 2018-12-05 09:40:00 8577
2 2018-12-05 09:50:00 29746
3 2018-12-05 10:00:00 29340
4 2018-12-05 10:10:00 5197
.resample is a bit of a wildcard. It behaves rather differently with datetime64[ns] and timedelta64[ns] so I personally find it more reliable to work with groupby, when just doing things like .sum or .first.
Sample Data
import pandas as pd
import numpy as np
n = 1000
np.random.seed(123)
df = pd.DataFrame({'time': pd.date_range('2018-01-01 01:13:43', '2018-01-01 23:59:59', periods=n),
'abc': np.random.randint(1,1000,n)})
When the dtype is datetime64[ns] it will resample to "round" bins:
df.dtypes
#time datetime64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
2018-01-01 01:10:00 2572
2018-01-01 01:20:00 2257
2018-01-01 01:30:00 2470
2018-01-01 01:40:00 3131
2018-01-01 01:50:00 3402
With timedelta64[ns] it instead begins the bins based on your first observation:
df['time'] = pd.to_timedelta(df.time.dt.time.astype('str'))
df.dtypes
#time timedelta64[ns]
#abc int32
#dtype: object
df.set_index('time').resample('600S').sum()
abc
time
01:13:43 3432
01:23:43 2447
01:33:43 2588
01:43:43 3202
01:53:43 2547
So in the case of a timedelta64[ns] column, I'd advise you to go with groupby creating bins out of .dt.floor to create your 10 minute bins that go from [XX:00:00 - XX:10:00]
df.groupby(df.time.dt.floor('600S')).sum()
# abc
#time
#01:10:00 2572
#01:20:00 2257
#01:30:00 2470
#01:40:00 3131
#01:50:00 3402
This is the same result we got in the first case with the datetime64[ns] dtype, which binned to the "round" bins.
If your use case is robust to it and you want to extend the time before the actual starting time, a solution is to add an empty row at the starting time you want.
E.g. truncating the first time (df.loc[0] if the index is sorted, else df.index.min()) to its hour (.floor("h")) :
df.loc[df.index.min().floor("h")] = None
df.sort_index(inplace=True) # cleaner, but not even needed
Then resample() will use this time as the starting point (9:00 in OP’s case).
This can also be applied to extend the time range after the actual end of the dataset.

pandas asfreq returns NaN if exact date DNE

Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,
Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]
df.asfreq('d').interpolate().asfreq('q')

How to remove the date information in a column, just keep time

I am using pandas dataframe. there is a specific column has time information.
the raw data likes this:
5:15am
5:28am
6:15am
so I need to convert the raw data into datetime format:
format = '%I:%M%p'
dataset['TimeStamp'] = pd.to_datetime(dataset['TimeStamp'],format)
However, I got:
2014-07-04 05:15:00
2014-07-04 05:28:00
2014-07-04 06:15:00
I don't want the year and date information, just want time. How can I remove it. Thanks.
Since version 0.17.0 you can just do
dataset['TimeStamp'].dt.time
For versions older than 0.17.0:
You can just call apply and access the time function on the datetime object create the column initially like this without the need for post processing:
In [143]:
dataset['TimeStamp'] = pd.to_datetime(dataset['TimeStamp'],format).apply(lambda x: x.time())
dataset
Out[143]:
TimeStamp
0 05:15:00
1 05:28:00
2 06:15:00
The following will convert what you have to datetime.time() objects:
dataset['TimeStamp'] = pd.Series([val.time() for val in dataset['TimeStamp']])
Output
TimeStamp
0 05:15:00
1 05:28:00
2 06:15:00
Just use the datetime.time() function
datetime.time()
Return time object with same hour, minute, second and microsecond. tzinfo is None. See also method timetz().
This will return a datetime.time object and you can access the data with the time.hour time.minute and time.second attributes.
your_date_df.dt.time
Lets say that your column with the date ans time is df['arrived_date']:
0 2015-01-06 00:43:00
1 2015-01-06 07:56:00
2 2015-01-06 11:02:00
3 2015-01-06 11:22:00
4 2015-01-06 15:27:00
Name: arrived_date, dtype: datetime64[ns]
Whith pandas, you just need to do:
df['arrived_time']=df['arrived_date'].dt.time
The new column df['arrived_time'] will look like this:
0 00:43:00
1 07:56:00
2 11:02:00
3 11:22:00
4 15:27:00
Name: arrived_time, dtype: object
Observe that the new column, df['arrived_time'], is no longer a datetime64 type, the type of the column is just a pandas object
There's a simpler way to do it using pandas, although most, if not all solutions are correct
df.TimeStamp = pd.to_datetime(df.TimeStamp).dt.strftime('%H:%M')
dataset['TimeStamp']=dataset['TimeStamp'].str.slice(11,18)

Categories

Resources