I'm trying to subtract two columns on a dataset which have string times in order to get a time value for statistical analysis.
Basically, TOC is start time and IA is end time.
Something is slightly wrong:
dfc = pd.DataFrame(zip(*[TOC,IA]),columns=['TOC','IA'])
print (dfc)
dfc.['TOC']= dfc.['TOC'].astype(dt.datetime)
dfc['TOC'] = pd.to_datetime(dfc['TOC'])
dfc['TOC'] = [time.time() for time in dfc['TOC']]
Convert the columns to datetime before subtracting:
>>> pd.to_datetime(dfc["IA"], format="%H:%M:%S")-pd.to_datetime(dfc["TOC"], format="%H:%M:%S")
0 0 days 00:08:07
1 0 days 00:15:29
2 0 days 00:11:14
3 0 days 00:27:50
dtype: timedelta64[ns]
Related
I'm calculating time differences, in seconds, between busses' expected and actual stop times.
My problem looks like this:
# creating data
d = {
'time_A': ['2022-08-30 06:21:00', '2022-08-30 16:41:00'],
'time_B': ['2022-08-30 06:21:09', '2022-08-30 16:40:16'],
}
# creating DataFrame
my_df = pd.DataFrame(d)
my_df['time_A'] = pd.to_datetime(my_df['time_A'])
my_df['time_B'] = pd.to_datetime(my_df['time_B'])
# subtracting times
my_df['difference'] = my_df['time_B'] - my_df['time_A']
my_df
result:
time_A time_B difference
0 2022-08-30 06:21:00 2022-08-30 06:21:09 0 days 00:00:09
1 2022-08-30 16:41:00 2022-08-30 16:40:16 -1 days +23:59:16
I don't understand why the difference between today 16:40:16 and today 16:41:00 is -1 days +23:59:16.
if I do this
my_df['difference'] = (my_df['time_B'] - my_df['time_A']).dt.seconds
Then I get
time_A time_B difference
0 2022-08-30 06:21:00 2022-08-30 06:21:09 9
1 2022-08-30 16:41:00 2022-08-30 16:40:16 86356
I would like the "difference" cell on row O to display something like "+9", and the one below to display "-44".
How do I do this? Thanks!
Subtracting datetime.datetimes gives datetime.timedeltas which are represented that way, use .total_seconds() to get numeric value of seconds, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({"schedule":pd.to_datetime(["2000-01-01 12:00:00"]),"actual":pd.to_datetime(["2000-01-01 12:00:05"])})
df['difference_sec'] = (df['schedule'] - df['actual']).apply(datetime.timedelta.total_seconds)
print(df)
output
schedule actual difference_sec
0 2000-01-01 12:00:00 2000-01-01 12:00:05 -5.0
Note that this is feature of datetime.timedelta, it is not specific to pandas.
I need to import a .xlsx sheet into pandas which has a column for the processing time of an associated activity. All entries in this column look somewhat like this:
01:20:34
12:22:30
25:01:02
155:20:56
Which says how much hours, minutes and seconds were needed. When I use pd.read_excel pandas correctly interprets each of the timestamps with less than 24 hours, and reads them as above in the first two cases. The timestamps with more than 24h (last two) on the other hand are converted into a datetime object, which in turn looks like this: 1900-01-02T14:58:03 instead of 62:58:03.
Is there a simple solution?
I think that part of the problem is not in Python/Pandas, but in Excel. Date '1900-01-01' is the base date used by Excel represented by number '1'. You can check that if you write '0' in a cell and then formate that cell to date, you get '1900-01-00' and '1' you get '1900-01-01'.
So, try to export your Excel file to a CSV file before importing to pandas and then import this way:
import pandas as pd
df1 = pd.read_csv('sample_data.csv')
In this case, you can get this DataFrame with the column Duration as a string (I added a column id for reference).
duration id
0 01:20:34 1
1 12:22:30 2
2 25:01:02 3
3 155:20:56 4
Then for your purpose, I suggest you Do not try to convert those values to datetime type, but a timedelta. A strategy will be to split the strings by colons and then build an instance of timedelta using those three fields: hours, minutes, and seconds.
import datetime as dt
def converter1(x):
vals = x.split(':')
vals = [int(val) for val in vals ]
out = dt.timedelta(hours=vals[0], minutes=vals[1], seconds=vals[2])
return out
df1['deltat'] = df1['duration'].apply(converter1)
duration id deltat
0 01:20:34 1 0 days 01:20:34
1 12:22:30 2 0 days 12:22:30
2 25:01:02 3 1 days 01:01:02
3 155:20:56 4 6 days 11:20:56
If you need to convert those values to a number of decimals hours or other new fields use the total_seconds() method from timedelta:
df1['deltat_hr'] = df1['deltat'].apply(lambda x: x.total_seconds()/3600)
duration id deltat deltat_hr
0 01:20:34 1 0 days 01:20:34 1.342778
1 12:22:30 2 0 days 12:22:30 12.375000
2 25:01:02 3 1 days 01:01:02 25.017222
3 155:20:56 4 6 days 11:20:56 155.348889
I've subtracted two datetimes from each other, like so:
df['Time Difference'] = df['Time 1'] - df['Time 2']
resulting in a timedelta object. I need the total number of minutes from this object, but I can't for the life of me figure it out. Currently, the "Time Difference" column looks like this:
1 0 days 00:01:00.000000000
2 0 days 00:04:00.000000000
3 0 days 00:03:00.000000000
4 0 days 00:01:00.000000000
5 0 days 00:03:00.000000000
I've tried dividing by a numpy timedelta (which seems to be the most common suggestion) as well as by pandas timedelta, as well as a few other things. Operations such as df['Time Difference'].seconds, or .seconds(), or .total_seconds, (all suggestions I've seen for this), all give errors. I'm really at a loss for what to do here. I need this in minutes in order to make graphs in matplotlib, and I'm kind of stuck until I figure this out, so any suggestions are very much appreciated. Thanks!
use dt.total_seconds() and divide by 60 to get the minutes:
import pandas as pd
df = pd.DataFrame({'td': pd.to_timedelta(['0 days 00:01:00.000000000',
'0 days 00:04:00.000000000',
'0 days 00:03:00.000000000',
'1 days 00:01:00.000000000',
'0 days 00:03:00.000000000'])})
df['delta_min'] = df['td'].dt.total_seconds() / 60
# df['delta_min']
# 0 1.0
# 1 4.0
# 2 3.0
# 3 1441.0
# 4 3.0
This is my dataframe:
issue,time_taken
aa,2 days 20:00:00.95
bb,2 days 19:12:48.276000
I just want to convert the time_taken column format into hours. I only need the total number of hours.
For example, I have to display output like
issue,time_taken,time_taken_hours
aa,2 days 20:00:00.95,68
I think you can use:
import numpy as np
df['time_taken_hours'] = df['time_taken'] / np.timedelta64(1, 'h')
print df
issue time_taken time_taken_hours
0 aa 2 days 20:07:49.958000 68.130544
1 bb 2 days 19:12:13.383000 67.203717
Frequency conversion in doc
I have a pandas DataFrame with a column "StartTime" that could be any datetime value. I would like to create a second column that gives the StartTime relative to the beginning of the week (i.e., 12am on the previous Sunday). For example, this post is 5 days, 14 hours since the beginning of this week.
StartTime
1 2007-01-19 15:59:24
2 2007-03-01 04:16:08
3 2006-11-08 20:47:14
4 2008-09-06 23:57:35
5 2007-02-17 18:57:32
6 2006-12-09 12:30:49
7 2006-11-11 11:21:34
I can do this, but it's pretty dang slow:
def time_since_week_beg(x):
y = x.to_datetime()
return pd.Timedelta(days=y.weekday(),
hours=y.hour,
minutes=y.minute,
seconds=y.second
)
df['dt'] = df.StartTime.apply(time_since_week_beg)
What I want is something like this, that doesn't result in an error:
df['dt'] = pd.Timedelta(days=df.StartTime.dt.dayofweek,
hours=df.StartTime.dt.hour,
minute=df.StartTime.dt.minute,
second=df.StartTime.dt.second
)
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
Any thoughts?
You can use a list comprehension:
df['dt'] = [pd.Timedelta(days=ts.dayofweek,
hours=ts.hour,
minutes=ts.minute,
seconds=ts.second)
for ts in df.StartTime]
>>> df
StartTime dt
0 2007-01-19 15:59:24 4 days 15:59:24
1 2007-03-01 04:16:08 3 days 04:16:08
2 2006-11-08 20:47:14 2 days 20:47:14
3 2008-09-06 23:57:35 5 days 23:57:35
4 2007-02-17 18:57:32 5 days 18:57:32
5 2006-12-09 12:30:49 5 days 12:30:49
6 2006-11-11 11:21:34 5 days 11:21:34
Depending on the format of StartTime, you may need:
...for ts in pd.to_datetime(df.StartTime)