How to subtract dates and accounting for null values? - python

I have a column that calculates the duration of seconds it takes from A to B in format hh:mm:ss. However, A and B may show null values in the data.
Let's say A=05:15:00 and B=naT, then the subtraction will return 5:15 seconds which is misleading and wrong in context due to B being infinity! How can I specify to only subtract columns B from A if both columns are NOT NULL?
This is the code I have:
df['A_to_B']=(df.B-df.A).dt.total_seconds()

Python does not use null, but it does use a type called None to represent the absence of a value / type. So you would check if df.B and df.A are both not None, perhaps like this:
if (df.A is not None) and (df.B is not None):
df['A_to_B'] = (df.B-df.A).dt.total_seconds()

You can do:
df['A_to_B'] = np.where(df['A'].notna() & df['B'].notna(),
(df['A'] - df['B']).dt.total_seconds(),
np.nan)
Sample data
A B
0 05:15:00 NaT
1 NaT 00:00:15
2 05:15:00 00:15:00
Output:
A B A_to_B
0 05:15:00 NaT NaN
1 NaT 00:00:15 NaN
2 05:15:00 00:15:00 18000.0

Related

how to convert time in unorthodox format to timestamp in pandas dataframe

I have a column in my dataframe which I want to convert to a Timestamp. However, it is in a bit of a strange format that I am struggling to manipulate. The column is in the format HHMMSS, but does not include the leading zeros.
For example for a time that should be '00:03:15' the dataframe has '315'. I want to convert the latter to a Timestamp similar to the former. Here is an illustration of the column:
message_time
25
35
114
1421
...
235347
235959
Thanks
Use Series.str.zfill for add leading zero and then to_datetime:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_datetime(s, format='%H%M%S')
print (df)
message_time
0 1900-01-01 00:00:25
1 1900-01-01 00:00:35
2 1900-01-01 00:01:14
3 1900-01-01 00:14:21
4 1900-01-01 23:53:47
5 1900-01-01 23:59:59
In my opinion here is better create timedeltas by to_timedelta:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_timedelta(s.str[:2] + ':' + s.str[2:4] + ':' + s.str[4:])
print (df)
message_time
0 00:00:25
1 00:00:35
2 00:01:14
3 00:14:21
4 23:53:47
5 23:59:59

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?
It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64

How to update some of the rows from another series in pandas using df.update

I have a df like,
stamp value
0 00:00:00 2
1 00:00:00 3
2 01:00:00 5
converting to time delta
df['stamp']=pd.to_timedelta(df['stamp'])
slicing only odd index and adding 30 mins,
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
#print(odd_df)
1 00:30:00
Name: stamp, dtype: timedelta64[ns]
now, updating df with odd_df,
as per the documentation it should give my expected output.
expected output:
df.update(odd_df)
#print(df)
stamp value
0 00:00:00 2
1 00:30:00 3
2 01:00:00 5
What I am getting,
df.update(odd_df)
#print(df)
stamp value
0 00:30:00 00:30:00
1 00:30:00 00:30:00
2 00:30:00 00:30:00
please help, what is wrong in this.
Try this instead:
df.loc[1::2, 'stamp'] += pd.to_timedelta('30 min')
This ensures you update just the values in DataFrame specified by the .loc() function while keeping the rest of your original DataFrame. To test, run df.shape. You will get (3,2) with the method above.
In your code here:
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
The odd_df DataFrame only has parts of your original DataFrame. The parts you sliced. The shape of odd_df is (1,).

pandas get data for the end day of month?

The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?
An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.
Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1

find closest rows between dataframes with positive timedelta

I have two dataframes each with a datetime column:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
The timestamps are unique and can be assumed sorted in each of the two dataframes.
I would like to create a new dataframe that contains the nearest (index,mytime_long) after or at the same time value in mytime_short (hence with a non-negative timedelta)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)
write a function to get the closest index & timestamp in df_short given a timestamp
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
apply this function over df_long.mytime_long, to get a new data frame with the index & timestamp values in a tuple
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev's answer reminded me of this pandas.merge_asof function which is perfect for this type of join
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT
Little bit ugly, but effective way to solve task. Idea is to join them on timestamp and select first "short" after "long" if any.
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
Out:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
Performance comparison on sample of 100000 elements:
186 ms to execute this implementation.
1min 3s to execute df_long.mytime_long.apply(get_closest)
UPD: but the winner is #Haleemur Ali's pd.merge_asof with 10ms

Categories

Resources