Calculate difference between two datetimes if both present in pandas DataFrame - python

I currently have various time columns (DateTime format) in a pandas DataFrame, as shown below:
Entry Time Exit Time
00:30:59.555 06:30:59.555
00:56:43.200
10:30:30.500 11:30:30.500
I would like to return the difference between these times (Exit Time - Entry Time) in a new column in the dataframe if both Entry Time and Exit Time are present. Otherwise, I would like to skip the row, as shown below:
Entry Time Exit Time Time Difference
00:30:59.555 06:30:59.555 06:00:00.000
00:56:43.200
10:30:30.500 12:00:30.500 01:30:00.000
I am fairly new to Python, so my apologies if this is an obvious question. Any help would be greatly appreciated!

If your dtypes are really datetime's then it's really simple:
In [36]:
df['Difference Time'] = df['Exit Time'] - df['Entry Time']
df
Out[36]:
Entry Time Exit Time Difference Time
0 2014-08-01 00:30:59.555000 2014-08-01 06:30:59.555000 06:00:00
1 2014-08-01 00:56:43.200000 NaT NaT
2 2014-08-01 10:30:30.500000 2014-08-01 11:30:30.500000 01:00:00
[3 rows x 3 columns]
If they are not then you need to convert them using pd.to_datetime e.g.
df['Entry time'] = pd.to_datetime(df['Entry Time'])
EDIT
There seems to be some additional weirdness with your data which I don't quite understand but the following seems to have worked for you:
df.dropna()['Exit_Time'] - df.dropna()['Entry_Time']

Related

Issue while converting pandas to datetime

I'm converting string to datetime datatype using pandas,
here is my snippet,
df[col] = pd.to_datetime(df[col], format='%H%M%S%d%m%Y', errors='coerce')
input :
col
00000001011970
00000001011970
...
00000001011970
output:
col
1970-01-01
1970-01-01
...
1970-01-01 00:00:00
the ouput consists of date and date with time..
I need the output as date with time.
PLease help me out where I am going wrong
The time is there. It just so happens, because it's midnight, 00:00:00, it is not showing explicitly.
You can see it's with e.g.
df[col].dt.minute
which will give a Series of 0's.
To print out the time explicitly, you could use
df[col].dt.strftime('%H:%M:%S')
Alter the format as you see fit.
Keep in mind that the visual output with anything in Pandas (or computers in general) does not have to be exactly what is stored. It is up to the programmer to format the output into what they want. But calculations on the variables still uses all (invisible) information.
Just like the other answer suggested time is there, but since it's midnight 00:00:00, it's not showing explicitly. To print out the date with time you can try this :
df[col] = pd.to_datetime(df[col], format='%H%M%S%d%m%Y', errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')

Pandas | How to get the time difference in seconds between two columns that contain timestamps

I have two columns that both contain times and I need to get the difference of the two times. I would like to add the difference of each row timestamps to a new column "time_diff". The times are only going to be 10-30 seconds apart so I need the time_diff column to be a difference in the seconds(like this format 00:00:07).
I'm really struggling with this its for my work and it is a bit out of my element. Greatly appreciate all of the answers.
Example of the format of the two columns
start_time | end_time
00:06:34 00:06:45
00:06:59 00:07:02
00:07:36 00:07:34
First convert these into datetime format as given below:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Then, you can perform subtract operations:
df['diff'] = df['end']-df['start']
This will give you answer in HH:MM:SS
In case you want to find answers only in seconds (it will give output in total seconds of difference)
df['diff'] = (df['end']-df['start']).dt.total_seconds()
Similar to #Dhiraj
df["time_diff"] = pd.to_datetime(df["end_time"]) - pd.to_datetime(df["start_time"])
df["time_diff_secs"] = (pd.to_datetime(df["end_time"]) - pd.to_datetime(df["start_time"])).dt.total_seconds()
OUTPUT->
start_time end_time time_diff time_diff_secs
0 00:06:34 00:06:45 00:00:11 11.0
1 00:06:59 00:07:02 00:00:03 3.0
2 00:07:36 00:07:34 -1 days +23:59:58 -2.0
You should be able to just do df["difference"] = df["end_time"] - df["start_time"] assuming your columns aren't strings. You can use pandas.to_datetime() to convert a column into datetime if that's the case. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

Finding individual timespans between two columns of data in datetime format, possibly using for loop

I am new to Python and coding in general and I have an issue with programming a for loop, as suggested by an instructor, to find the time elapsed between the shutdown and restart times for a powerplant.
I managed to isolate the columns I was interested in by forming a dataframe:
oilSubData4 = pd.DataFrame(oilData[['Shutdown Date/Time', 'Restart Date/Time']])
I also managed to convert the columns into datetime format and removed the NaT rows:
oilSubData4['Shutdown Date/Time'] = pd.to_datetime(oilSubData4['Shutdown Date/Time'])
oilSubData4['Restart Date/Time'] = pd.to_datetime(oilSubData4['Restart Date/Time'])
oilShutdownTime = oilSubData4.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
It all culminated to form this
image of the data I wish to find the individual timespans of
It's at this point that I struggled to find a way to find the time difference between each restart time/date and shutdown time/date for each row of data. I am not experienced with for loops and I am unsure of how to begin.
My attempt is as follows:
for x in oilShutdownTime:
oilShutdownTime['time_diff'][x+1] = oilShutdownTime['Restart Date/Time'][x+1] - oilShutdownTime['Restart Date/Time'][x]
and the following error shows:
TypeError Traceback (most recent call last)
in
1 for x in oilShutdownTime:
----> 2 oilShutdownTime['time_diff'][x+1] = oilShutdownTime['Restart Date/Time'][x+1] - oilShutdownTime['Restart Date/Time'][x]
TypeError: can only concatenate str (not "int") to str
Please advise on how to fix this if possible. I am aware that my code might be wholly inaccurate so any help is appreciated, thank you!
Your for-loop confuses me. I am not sure what you where trying to do.
Let's take the first row from your example:
"Shutdown Date/Time","Restart Date/Time"
2010-01-08 23:41:00,2010-01-13 09:17:00
Do you want the result for this row to be 4 days 09:36:00? This is how I understand your question. In that case, try the following:
>>> oilShutdownTime['Restart Date/Time'] - oilShutdownTime['Shutdown Date/Time']
0 4 days 09:36:00
1 0 days 01:00:00
2 0 days 00:00:40
3 0 days 12:10:00
4 1 days 10:03:15
dtype: timedelta64[ns]
Note: Only the first row is real data, I did not type in all dates from your image..
There might be no need for a loop at this point. Pandas can handle column-wise calculations way more efficiently internally than you could with a for-loop on your own.

How to get the day difference between date-column and maximum date of same column or different column in Python?

I am setting up a new column as the day difference in Python (on Jupyter notebook).
I carried out the day difference between the column date and current day. Also, I carried out that the day difference between the date column and newly created day via current day (Current day -/+ input days with timedelta function).
However, whenever I use max() of the same column and different column, the day difference column has NaN values. It does not make sense for me, maybe I am missing the date type. When I checked the types all of them seems datetime64 (already converted to datetime64 by me).
I thought that the reason was having not big enough date. However, it happens with any specific date like max(datecolumn)+timedelta(days=i).
t=data_signups[["date_joined"]].max()
date_joined 2019-07-18 07:47:24.963450
dtype: datetime64[ns]
t = t + timedelta(30)
date_joined 2019-08-17 07:47:24.963450
dtype: datetime64[ns]
data_signups['joined_to_today'] = (t - data_signups['date_joined']).dt.days
data_signups.head(2)
shortened...
date_joined_______________// joined_to_today________
2019-05-31 10:52:06.327341 // nan
2019-04-02 09:20:26.520272 // nan
However it worked on Current day task like below.
Currentdate = datetime.datetime.now()
print(Currentdate)
2019-09-01 17:05:48.934362
before_days=int(input("Enter the number of days before today for analysis "))
30
Done
last_day_for_analysis = Currentdate - timedelta(days=before_days)
print(last_day_for_analysis)
2019-08-02 17:05:48.934362
data_signups['joined_to_today'] = (last_day_for_analysis - data_signups['date_joined']).dt.days
data_signups.head(2)
shortened...
date_joined_______________// joined_to_today________
2019-05-31 10:52:06.327341 // 63
2019-04-02 09:20:26.520272 // 122
I expect that there is datetype problem. However, I could not figure out since all of them are datetime64. There are no NaN values in the columns.
Thank you for your help. I am newbie and I try to learn everyday continuously.
Although I was busy with this question for 2 days, now I realized that I had a big mistake. Sorry to everyone.
The reason that can not take the maximum value as date comes from as below.
Existing one: t=data_signups[["date_joined"]].max()
Must-be-One: t=data_signups["date_joined"].max()
So it works with as below.
data_signups['joined_to_today'] = (data_signups['date_joined'].max() - data_signups['date_joined']).dt.days
data_signups.head(3)
There will be no two brackets. So stupid mistake. Thank you.

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources