Reorder timestamps pandas - python

I have a pandas column that contain timestamps that are unordered. When I sort them it works fine except for the values H:MM:SS.
d = ({
'A' : ['8:00:00','9:00:00','10:00:00','20:00:00','24:00:00','26:20:00'],
})
df = pd.DataFrame(data=d)
df = df.sort_values(by='A',ascending=True)
Out:
A
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
0 8:00:00
1 9:00:00
Ideally, I'd like to add a zero before 5 letter strings. If I convert them all to time delta it converts the times after midnight into 1 day plus n amount of hours. e.g.
df['A'] = pd.to_timedelta(df['A'])
A
0 0 days 08:00:00
1 0 days 09:00:00
2 0 days 10:00:00
3 0 days 20:00:00
4 1 days 00:00:00
5 1 days 02:20:00
Intended Output:
A
0 08:00:00
1 09:00:00
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00

If you only need to sort by the column as timedelta, you can convert the column to timedelta and use argsort on it to create the sorting order to sort the data frame:
df.iloc[pd.to_timedelta(df.A).argsort()]
# A
#0 8:00:00
#1 9:00:00
#2 10:00:00
#3 20:00:00
#4 24:00:00
#5 26:20:00

Related

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

How to get time difference in specifc rows include in one column data using python

Here I have a dataset with time and three inputs. Here I calculate the time difference using panda.
code is :
data['Time_different'] = pd.to_timedelta(data['time'].astype(str)).diff(-1).dt.total_seconds().div(60)
This is reading the difference of time in each row. But I want to write a code for find the time difference only specific rows which are having X3 values.
I tried to write the code using for loop. But it's not working properly. Without using for loop can we write the code.?
As you can see in my image I have three inputs, X1,X2,X3. Here when I used that code it is showing the time difference of X1,X2,X3.
Here what I want to write is getting the time difference for X3 inputs which are having a values.
time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20
Then here I want to skip the time of having 0 values of X3 and want to read only time difference of values of X3.
time x3
7:00:00 2(values having)
9:00:00 50
So the time difference is 2hrs
Then second:
9:00:00 50
19:00:00 20
Then time difference is 10 hrs
Like wise I want write the code or my whole column. Can anyone help me to solve this?
While putting the code then get the error with time difference in minus value.
You can try to:
Find rows where X3 different from 0
Compute the difference is hours using shift
Update the dataframe using join:
Full example:
data = """time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20"""
# Build dataframe from example
df = pd.read_csv(StringIO(data), sep=r'\s{1,}')
df['X1'] = np.random.randint(0,10,len(df)) # Add random values for "X1" column
df['X2'] = np.random.randint(0,10,len(df)) # Add random values for "X2" column
# Convert the time column to datetime object
df.time = pd.to_datetime(df.time, format="%H:%M:%S")
print(df)
# time X3 X1 X2
# 0 1900-01-01 06:00:00 0 5 4
# 1 1900-01-01 07:00:00 2 7 1
# 2 1900-01-01 08:00:00 0 2 8
# 3 1900-01-01 09:00:00 50 1 0
# 4 1900-01-01 10:00:00 0 3 9
# 5 1900-01-01 11:00:00 0 8 4
# 6 1900-01-01 12:00:00 0 0 2
# 7 1900-01-01 13:45:00 0 5 0
# 8 1900-01-01 15:00:00 0 5 7
# 9 1900-01-01 16:00:00 0 0 8
# 10 1900-01-01 17:00:00 0 6 7
# 11 1900-01-01 18:00:00 0 1 5
# 12 1900-01-01 19:00:00 20 4 7
# Compute difference
sub_df = df[df.X3 != 0]
out_values = (sub_df.time.dt.hour - sub_df.shift().time.dt.hour) \
.to_frame() \
.fillna(sub_df.time.dt.hour.iloc[0]) \
.rename(columns={'time': 'out'}) # Rename column
print(out_values)
# out
# 1 7.0
# 3 2.0
# 12 10.0
df = df.join(out_values) # Add out values
print(df)
# time X3 X1 X2 out
# 0 1900-01-01 06:00:00 0 2 9 NaN
# 1 1900-01-01 07:00:00 2 7 4 7.0
# 2 1900-01-01 08:00:00 0 6 6 NaN
# 3 1900-01-01 09:00:00 50 9 1 2.0
# 4 1900-01-01 10:00:00 0 2 9 NaN
# 5 1900-01-01 11:00:00 0 5 3 NaN
# 6 1900-01-01 12:00:00 0 6 4 NaN
# 7 1900-01-01 13:45:00 0 9 3 NaN
# 8 1900-01-01 15:00:00 0 3 0 NaN
# 9 1900-01-01 16:00:00 0 1 8 NaN
# 10 1900-01-01 17:00:00 0 7 5 NaN
# 11 1900-01-01 18:00:00 0 6 7 NaN
# 12 1900-01-01 19:00:00 20 1 5 10.0
Here is use .fillna(sub_df.time.dt.hour.iloc[0]) to replace the first values with the matching hours (since the subtract 0 does nothing). You can define your own rule for the value in fillna().

How to find the datetime difference between rows in a column, based on the condition?

I have the following pandas DataFrame df:
date time val1
2018-12-31 09:00:00 15
2018-12-31 10:00:00 22
2018-12-31 11:00:00 19
2018-12-31 11:30:00 10
2018-12-31 11:45:00 5
2018-12-31 12:00:00 1
2018-12-31 12:05:00 6
I want to find how many minutes are between the val1 value that is greater than 20 and the val1 value that is lower than or equal to 5?
In this example, the answer is 1 hour and 45 minutes = 95 minutes.
I know how to check the difference between two datetime values:
(df.from_datetime-df.to_datetime).astype('timedelta64[m]')
But how to slice it over the DataFrame, detecting the proper rows?
UPDATE: Taking into consideration that date might be different
Convert the date column to a datetime object and time column to a timedelta object and combine them to get another datetime object
df.time = pd.to_timedelta(df.time)
df.date = pd.to_datetime(df.date)
df['date_time'] = df['date'] + df['time']
df
date time val1 date_time
0 2018-12-31 09:00:00 15 2018-12-31 09:00:00
1 2018-12-31 10:00:00 22 2018-12-31 10:00:00
2 2018-12-31 11:00:00 19 2018-12-31 11:00:00
3 2018-12-31 11:30:00 10 2018-12-31 11:30:00
4 2018-12-31 11:45:00 5 2018-12-31 11:45:00
5 2018-12-31 12:00:00 1 2018-12-31 12:00:00
6 2018-12-31 12:05:00 6 2018-12-31 12:05:00
Now could use one of these two methods
1) Love lambdas and this works with Series objects.
subtr = lambda d1, d2: abs(d1 - d2)/np.timedelta64(1, 'm')
d20 = df[df.val1 > 20].date_time.iloc[0]
d5 = df[df.val1 <= 5].date_time.iloc[0]
subtr(d20, d5)
105.0
2) Needs DataFrame object instead of Series object. Hinders with my aesthetics
d20 = df[df.val1 <= 5][['date_time']].iloc[0]
d5 = df[df.val1 > 20][['date_time']].iloc[0]
abs(d5 - d20).astype('timedelta64[m]')[0]
105.0
So this is my approach:
1) Filter out any val1 that is not >= 20 or <= 5
df = pd.DataFrame({'date':['2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31'],
'time':['09:00:00', '10:00:00', '11:00:00', '11:30:00', '11:45:00', '12:00:00', '12:05:00'],
'val1': [15,22,19,10,5,1,6]})
df2 = df[(df['val1'] >= 20)|(df['val1'] <= 5)].copy()
Then we will do the following code:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(-1) >= 15,
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]'),
np.NaN)
Let me go through this.
np.where is a if statement, where if the first statment is true it will do the second, if not true then the third.
df2['val1'] - df2['val1'].shift(-1) >= 15 Since we filtered the df the minimum difference between two rows must be great than or equal to 15.
If it is true:
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]') We take the later time and subtract it from the beginning time.
If not true, we just return np.NaN
We get a df that looks like the following:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 01:45:00
4 2018-12-31 11:45:00 5 NaT
5 2018-12-31 12:00:00 1 NaT
If you want to put the TimeDiff on the end time you can do the following:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(1) <= -15,
df2['time'].astype('datetime64[ns]') - df2['time'].astype('datetime64[ns]').shift(),
np.NaN)
and you will get:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 NaT
4 2018-12-31 11:45:00 5 01:45:00
5 2018-12-31 12:00:00 1 NaT

Convert and order time in a pandas df

I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object

Changing datetime column to integer number without loop

I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')

Categories

Resources