change string into datetime in pandas - python

How can I change the following string datetime into datetime in python.Here's my dataframe
IN OUT
2022/6/10 10:20:30.00000000000000000000000000 2022/6/17 13:25:30
2022/6/5 12:48:10.0 2022/6/11 10:15
2022/6/9 08:25:30 2022/6/13 10:25:30
2022-06-08 17:18:37.00000000000000000000 0
0 0
2022-06-08 17:18:37 2022/06/08 19:38
[image of df]
I want to delete the row containing 0 value and change string into datetime of format '%Y-%m-%d %H:%M:%S'.
Here's is my code....
import pandas as pd
from datetime import datetime as dt
def string_to_date(my_string):
if '-' and '.' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '/' and '.' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y/%m/%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '/' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y/%m/%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '-' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
else:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
if name=='main':
df=pd.read_excel('data.xlsx')
col=df.columns[0:]
df=df.loc[~(df=='0').all(axis=1)]
print(df)
i=0
for n in col:
df[col[i]]=pd.to_datetime(df[col[i]])
df[col[i]]=df[col[i]].apply(lambda x:string_to_date(x))
i+=1
print(df)

Letting pandas infer the format should get you started. You can parse to datetime data type like
df['IN'] = pd.to_datetime(df['IN'], errors='coerce')
df['IN']
0 2022-06-10 10:20:30
1 2022-06-05 12:48:10
2 2022-06-09 08:25:30
3 2022-06-08 17:18:37
4 NaT
5 2022-06-08 17:18:37
Name: IN, dtype: datetime64[ns]
Note that setting keyword errors='coerce' leaves NaT (not-a-time) for all elements that pandas considers to be not a datetime, e.g. "0"
Now you can drop rows that have NaT, e.g.
df['OUT'] = pd.to_datetime(df['OUT'], errors='coerce')
df
IN OUT
0 2022-06-10 10:20:30 2022-06-17 13:25:30
1 2022-06-05 12:48:10 2022-06-11 10:15:00
2 2022-06-09 08:25:30 2022-06-13 10:25:30
3 2022-06-08 17:18:37 NaT
4 NaT NaT
5 2022-06-08 17:18:37 2022-06-08 19:38:00
df = df.dropna(axis=0, how='all')
df
IN OUT
0 2022-06-10 10:20:30 2022-06-17 13:25:30
1 2022-06-05 12:48:10 2022-06-11 10:15:00
2 2022-06-09 08:25:30 2022-06-13 10:25:30
3 2022-06-08 17:18:37 NaT
5 2022-06-08 17:18:37 2022-06-08 19:38:00
docs: pd.to_datetime, pd.DataFrame.dropna, related: parsing/formatting directives

Related

Time intervals to evenly-spaced time series

I need to prepare data with time intervals for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)
Output:
Start End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00 a
1 2022-07-01 22:30:00 2022-07-02 03:30:00 b
I try to get timestamps:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])
Output:
Datetime Val
0 2022-07-01 12:00:00 a
0 2022-07-01 14:00:00 a
0 2022-07-01 18:00:00 a
1 2022-07-01 22:00:00 b
1 2022-07-02 02:00:00 b
As you can see, those timestamps are not equally spaced. My expected result:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
We can use the function merge_asof:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() + pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())
Output:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
Annotated code
# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')})
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
Result
datetime Val
0 2022-07-01 12:00:00 a
1 2022-07-01 15:00:00 a
2 2022-07-01 18:00:00 a
3 2022-07-01 21:00:00 NaN
4 2022-07-02 00:00:00 b
5 2022-07-02 03:00:00 b
For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest, and a tolerance='90m' to ensure you leave gaps where you need them.
You can create you regular spaced time series using pd.date_range with start and end arguments using the .floor('3H') and .ceil('3H') methods, respectively.
import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()
which returns:
Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b
Another method would be to simply add the timehours to the start time in a loop.
from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time + 15 hours just for example
end = datetime.now() + timedelta(hours = 15)
times = []
while end>start:
start = start+timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times
Output
Times
0 2022-07-15 01:28:56.912013
1 2022-07-15 04:28:56.912013
2 2022-07-15 07:28:56.912013
3 2022-07-15 10:28:56.912013
4 2022-07-15 13:28:56.912013

How to simplify my resample DataFrame according to specific rules?

I have a dataframe with frequency as minutes, time is [['09:30:00'-'11:30:00'],['13:00:00'-'15:00:00']], demo is generate_date(),
I want to resample it according to the elapsed hours, but each period of time is closed before and after.
My approach is the resample() function:
Add a little time to the left.
Resample to 30T.
Add a small time to 10:00:00 and 11:00:00.
Separate morning and afternoon data.
The data in the morning is resampled to 30T again, and the data in the afternoon is resampled to H.
Combine morning and afternoon data.
Can my resample() function be simplified?
My code is as follows:
import pandas as pd
import numpy as np
import datetime
def generate_data():
datetime1 = pd.date_range('2021-12-01 09:30:00', '2021-12-01 15:00:00', freq='t')
df = pd.DataFrame(datetime1, columns=['datetime'])
df['time'] = df.datetime.dt.strftime('%H:%M:%S')
df = df[(df.time <= '11:30:00') | (df.time >= '13:00:00')].reset_index(drop=True)
np.random.seed(2021)
df['close'] = np.random.random(len(df)) * 100
return df
def resample(df):
df.loc[df['time'].isin(['09:30:00', '13:00:00']), 'datetime'] = df.loc[
df['time'].isin(['09:30:00', '13:00:00']), 'datetime'].apply(lambda x: x + datetime.timedelta(seconds=0.001))
df = df.set_index('datetime')
df = df.resample('30T', closed='right', label='right').last()
df = df.dropna(subset=['close'])
df = df.reset_index()
df.loc[:, 'time'] = df.datetime.dt.strftime('%H:%M:%S')
df.loc[df['time'].isin(['10:00:00', '11:00:00']), 'datetime'] = df.loc[
df['time'].isin(['10:00:00', '11:00:00']), 'datetime'].apply(lambda x: x + datetime.timedelta(seconds=0.001))
df = df.set_index('datetime')
df1 = df[df.time < '12:00:00']
df2 = df[df.time > '12:00:00']
df1 = df1.resample('30T', closed='right', label='right').last()
df2 = df2.resample('H', closed='right', label='right').last()
df = pd.concat([df1, df2])
df = df.dropna(subset=['close'])
df = df.reset_index()
return df
def main():
df = generate_data()
print('\nBefore resample:')
print(df)
df = resample(df)
print('\nAfter resample:')
print(df)
if __name__ == '__main__':
main()
Before resample:
datetime time close
0 2021-12-01 09:30:00 09:30:00 60.597828
1 2021-12-01 09:31:00 09:31:00 73.336936
2 2021-12-01 09:32:00 09:32:00 13.894716
3 2021-12-01 09:33:00 09:33:00 31.267308
4 2021-12-01 09:34:00 09:34:00 99.724328
.. ... ... ...
237 2021-12-01 14:56:00 14:56:00 98.774245
238 2021-12-01 14:57:00 14:57:00 67.903063
239 2021-12-01 14:58:00 14:58:00 40.640360
240 2021-12-01 14:59:00 14:59:00 50.995722
241 2021-12-01 15:00:00 15:00:00 88.935107
[242 rows x 3 columns]
After resample:
datetime time close
0 2021-12-01 10:30:00 10:30:00 6.383650
1 2021-12-01 11:30:00 11:30:00 26.667989
2 2021-12-01 14:00:00 14:00:00 19.255257
3 2021-12-01 15:00:00 15:00:00 88.935107

Check if datetime column is empty

I want to check inside function like if datetime column is empty do something.
My sample df:
date_dogovor date_pogash date_pogash_posle_prodl
0 2019-03-07 2020-03-06 NaT
1 2019-02-27 2020-02-05 NaT
2 2011-10-14 2016-10-13 2019-10-13
3 2019-03-28 2020-03-06 NaT
4 2019-04-17 2020-04-06 NaT
My function:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if date_paymnt_aftr_prlngtn is None:
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract
Applying function to df:
df['term'] = df.apply(lambda x: term(x['date_dogovor'], x['date_pogash'], x['date_pogash_posle_prodl']), axis=1 )
Result is wrong:
df['term']
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
...
115337 NaT
115338 NaT
115339 2921 days
115340 NaT
115341 NaT
Name: term, Length: 115342, dtype: timedelta64[ns]
How to correctly check if datetime column is empty?
Here is better/faster use numpy.where with Series.isna:
df['term'] = np.where(df['date_pogash_posle_prodl'].isna(),
df['date_pogash'] - df['date_dogovor'],
df['date_dogovor'] - df['date_dogovor'])
Your function should be changed with pandas.isna:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if pd.isna(date_paymnt_aftr_prlngtn):
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract

Unable to convert a column to datetime

I have tried many suggestions from here but none of them solved.
I have two columns with observations like this: 15:08:19
If I write
df.time_entry.describe()
it appears:
count 814262
unique 56765
top 15:03:00
freq 103
Name: time_entry, dtype: object
I've already run this code:
df['time_entry'] = pd.to_datetime(df['time_entry'],format= '%H:%M:%S', errors='ignore' ).dt.time
But rerunning the describe code still returns dtype: object.
What is the purpose of dt.time?
Just remove dt.time and your conversion from object to datetime will work perfectly fine.
df['time_entry'] = pd.to_datetime(df['time_entry'],format= '%H:%M:%S')
The problem is that you are using the datetime accessor (.dt) with the property time and then you are not able to subtract the two columns from eachother. So, just leave out .dt.time and it should work.
Here is some data with 2 columns of strings
df = pd.DataFrame()
df['time_entry'] = ['12:01:00', '15:03:00', '16:43:00', '14:11:00']
df['time_entry2'] = ['13:03:00', '14:04:00', '19:23:00', '18:12:00']
print(df)
time_entry time_entry2
0 12:01:00 13:03:00
1 15:03:00 14:04:00
2 16:43:00 19:23:00
3 14:11:00 18:12:00
Convert both columns to datetime dtype
df['time_entry'] = pd.to_datetime(df['time_entry'], format= '%H:%M:%S', errors='ignore')
df['time_entry2'] = pd.to_datetime(df['time_entry2'], format= '%H:%M:%S', errors='ignore')
print(df)
time_entry time_entry2
0 1900-01-01 12:01:00 1900-01-01 13:03:00
1 1900-01-01 15:03:00 1900-01-01 14:04:00
2 1900-01-01 16:43:00 1900-01-01 19:23:00
3 1900-01-01 14:11:00 1900-01-01 18:12:00
print(df.dtypes)
time_entry datetime64[ns]
time_entry2 datetime64[ns]
dtype: object
(Optional) Specify timezone
df['time_entry'] = df['time_entry'].dt.tz_localize('US/Central')
df['time_entry2'] = df['time_entry2'].dt.tz_localize('US/Central')
Now perform the time difference (subtraction) between the 2 columns and get the time difference in number of days (as a float)
Method 1 gives Diff_days1
Method 2 gives Diff_days2
Method 3 gives Diff_days3
df['Diff_days1'] = (df['time_entry'] - df['time_entry2']).dt.total_seconds()/60/60/24
df['Diff_days2'] = (df['time_entry'] - df['time_entry2']) / np.timedelta64(1, 'D')
df['Diff_days3'] = (df['time_entry'].sub(df['time_entry2'])).dt.total_seconds()/60/60/24
print(df)
time_entry time_entry2 Diff_days1 Diff_days2 Diff_days3
0 1900-01-01 12:01:00 1900-01-01 13:03:00 -0.043056 -0.043056 -0.043056
1 1900-01-01 15:03:00 1900-01-01 14:04:00 0.040972 0.040972 0.040972
2 1900-01-01 16:43:00 1900-01-01 19:23:00 -0.111111 -0.111111 -0.111111
3 1900-01-01 14:11:00 1900-01-01 18:12:00 -0.167361 -0.167361 -0.167361
EDIT
If you're trying to access datetime attributes, then you can do so by using the time_entry column directly (not the time difference column). Here's an example
df['day1'] = df['time_entry'].dt.day
df['time1'] = df['time_entry'].dt.time
df['minute1'] = df['time_entry'].dt.minute
df['dayofweek1'] = df['time_entry'].dt.weekday
df['day2'] = df['time_entry2'].dt.day
df['time2'] = df['time_entry2'].dt.time
df['minute2'] = df['time_entry2'].dt.minute
df['dayofweek2'] = df['time_entry2'].dt.weekday
print(df[['day1', 'time1', 'minute1', 'dayofweek1',
'day2', 'time2', 'minute2', 'dayofweek2']])
day1 time1 minute1 dayofweek1 day2 time2 minute2 dayofweek2
0 1 12:01:00 1 0 1 13:03:00 3 0
1 1 15:03:00 3 0 1 14:04:00 4 0
2 1 16:43:00 43 0 1 19:23:00 23 0
3 1 14:11:00 11 0 1 18:12:00 12 0

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?
It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64

Categories

Resources