This question already has answers here:
compute time difference of DateTimeIndex
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Dataframe with a datetimeindex and I need to create a column that contains the difference in time between the rows of the datetimeindex expressed in hours. This is what I have:
Datetime Numbers
2020-11-27 08:30:00 1
2020-11-27 13:00:00 2
2020-11-27 15:15:00 3
2020-11-27 20:45:00 4
2020-11-28 08:45:00 5
2020-11-28 10:45:00 6
2020-12-01 04:00:00 7
2020-12-01 08:15:00 8
2020-12-01 12:45:00 9
2020-12-01 14:45:00 10
2020-12-01 17:15:00 11
...
This is what I need:
Datetime Numbers Delta
2020-11-27 08:30:00 1 Nan
2020-11-27 13:00:00 2 4.5
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.5
2020-11-28 08:45:00 5 12
2020-11-28 10:45:00 6 2
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.5
2020-12-01 14:45:00 10 2
2020-12-01 17:15:00 11 2.5
...
The Dataframe has thousands of rows so I can't use a "for" loop. Thanks in advance!
EDIT: I found a solution:
df = df.reset_index()
df['Time'] = df['Datetime'].astype(np.int64) // 10**9
df['Delta'] = df['Time'].diff()/3600
df.drop(columns=['Time'],inplace =True)
df.set_index('Datetime', inplace=True)
I assume that Datetime is set as index:
df.reset_index(inplace=True)
df['Delta'] = df['Datetime'].diff().dt.total_seconds()/3600
df.set_index('Datetime', inplace=True)
OUTPUT:
Numbers Delta
Datetime
2020-11-27 08:30:00 1 NaN
2020-11-27 13:00:00 2 4.50
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.50
2020-11-28 08:45:00 5 12.00
2020-11-28 10:45:00 6 2.00
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.50
2020-12-01 14:45:00 10 2.00
2020-12-01 17:15:00 11 2.50
Related
I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN
I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
I have a dataframe with a datetime type column and a float type column.
date value
0 2010-01-01 01:23:00 21.2
1 2010-01-02 01:33:00 63.4
2 2010-01-03 06:02:00 80.6
3 2010-01-04 06:05:00 50.1
4 2010-01-05 06:20:00 346.5
5 2010-01-06 07:44:00 111.8
6 2010-01-07 08:00:00 113.1
7 2010-01-08 08:22:00 10.6
8 2010-01-09 09:00:00 287.2
9 2010-01-10 09:14:00 1652.6
I want to create a new column to record the mean value of one hours before the current iteration row time.
[UPDATE] Example:
If the current iteration is 4 2010-01-05 06:20:00 346.5 , I need to calculate (50.1 + 80.6) / 2 (value in range 2010-01-05 05:20:00~2010-01-05 06:20:00 and calculate mean).
date value before_1hr_mean
4 2010-01-05 06:20:00 346.5 65.35
I use iterrows() to solve this problem like the following code. But this method is really slow and the function iterrows() is usually not recommended in pandas and this row will become as
[UPDATE]
df['before_1hr_mean'] = np.nan
for index, row in df.iterrows():
df.loc[index, 'before_1hr_mean'] = df[(df['date'] < row['date']) & \
(df['date'] >= row['date'] - pd.Timedelta(hours=1))]['value'].mean()
Is there a better way to deal with this situation?
I took the liberty of changing your data to make it all the same day. It's the only way I could make sense of your question.
df.join(
df.set_index('date').value.rolling('H').mean().rename('before_1hr_mean'),
on='date'
)
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 21.200000
1 2010-01-01 01:33:00 63.4 42.300000
2 2010-01-01 06:02:00 80.6 80.600000
3 2010-01-01 06:05:00 50.1 65.350000
4 2010-01-01 06:20:00 346.5 159.066667
5 2010-01-01 07:44:00 111.8 111.800000
6 2010-01-01 08:00:00 113.1 112.450000
7 2010-01-01 08:22:00 10.6 78.500000
8 2010-01-01 09:00:00 287.2 148.900000
9 2010-01-01 09:14:00 1652.6 650.133333
If you want to exclude the current row, you have to track the sum and count of the rolling hour and back out what the average is after adjusting for the current value.
s = df.set_index('date')
sagg = s.rolling('H').agg(['sum', 'count']).value.rename(columns=str.title)
agged = df.join(sagg, on='date')
agged
date value Sum Count
0 2010-01-01 01:23:00 21.2 21.2 1.0
1 2010-01-01 01:33:00 63.4 84.6 2.0
2 2010-01-01 06:02:00 80.6 80.6 1.0
3 2010-01-01 06:05:00 50.1 130.7 2.0
4 2010-01-01 06:20:00 346.5 477.2 3.0
5 2010-01-01 07:44:00 111.8 111.8 1.0
6 2010-01-01 08:00:00 113.1 224.9 2.0
7 2010-01-01 08:22:00 10.6 235.5 3.0
8 2010-01-01 09:00:00 287.2 297.8 2.0
9 2010-01-01 09:14:00 1652.6 1950.4 3.0
Then do some math and assign a new column
df.assign(before_1hr_mean=agged.eval('(Sum - value) / (Count - 1)'))
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 NaN
1 2010-01-01 01:33:00 63.4 21.20
2 2010-01-01 06:02:00 80.6 NaN
3 2010-01-01 06:05:00 50.1 80.60
4 2010-01-01 06:20:00 346.5 65.35
5 2010-01-01 07:44:00 111.8 NaN
6 2010-01-01 08:00:00 113.1 111.80
7 2010-01-01 08:22:00 10.6 112.45
8 2010-01-01 09:00:00 287.2 10.60
9 2010-01-01 09:14:00 1652.6 148.90
Notice that you get nulls when there isn't an hours worth of prior data to calculate over.
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...