Python / Pandas convert object with date and time into separate columns - python

I have 2 x columns called Start_Time and End_time, they each contain dates in this format: "dd/mm/yyyy hh:mm".
I am trying to extract/ clean up the info so that Start/ End Time show time only, and that new columns Start_Date and End_Date show date only.
I have seen 101 examples of this online and by all accounts the following should work:
df['Start_Date'] = pd.to_datetime(df['Start_Time']).dt.date
df['Start_Time'] = pd.to_datetime(df['Start_Time']).dt.time
df['End_Date'] = pd.to_datetime(df['End_Time']).dt.date
df['End_Time'] = pd.to_datetime(df['End_Time']).dt.time
I am getting the following error however:
"TypeError: <class 'datetime.time'> is not convertible to datetime"
Start_Time and End_time are currently objects - I have tried converting them to to type datetime but also run into errors.
Can anyone tell me what I am doing wrong? Thank you!

Use DataFrame.assign for avoid overwrite original column:
df = pd.DataFrame({'Start_Time':['02/03/2022 15:15','05/03/2022 15:15'],
'End_Time':['12/03/2022 20:15','07/04/2022 20:15']})
df = df.assign(Start_Date = pd.to_datetime(df['Start_Time'], dayfirst=True).dt.date,
Start_Time = pd.to_datetime(df['Start_Time'], dayfirst=True).dt.time,
End_Date = pd.to_datetime(df['End_Time'], dayfirst=True).dt.date,
End_Time = pd.to_datetime(df['End_Time'], dayfirst=True).dt.time )
print (df)
Start_Time End_Time Start_Date End_Date
0 15:15:00 20:15:00 2022-03-02 2022-03-12
1 15:15:00 20:15:00 2022-03-05 2022-04-07
Or use helper variables:
df = pd.DataFrame({'Start_Time':['02/03/2022 15:15','05/03/2022 15:15'],
'End_Time':['12/03/2022 20:15','07/04/2022 20:15']})
s1 = pd.to_datetime(df['Start_Time'], dayfirst=True)
s2 = pd.to_datetime(df['End_Time'], dayfirst=True)
df['Start_Date'] = s1.dt.date
df['End_Date'] = s2.dt.date
df['Start_Time'] = s1.dt.time
df['End_Time'] = s2.dt.time
print (df)
Start_Time End_Time Start_Date End_Date
0 15:15:00 20:15:00 2022-03-02 2022-03-12
1 15:15:00 20:15:00 2022-03-05 2022-04-07

Related

Get range of dates between specified start and end date from csv using python

I have a problem in which i have a CSV file with StartDate and EndDate, Consider 01-02-2020 00:00:00 and 01-03-2020 00:00:00
And I want a python program that finds the dates in between the dates and append in next rows like
So here instead of dot , it should increment Startdate and keep End date as it is.
import pandas as pd
df = pd.read_csv('MyData.csv')
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
df['Dates'] = [pd.date_range(x, y) for x , y in zip(df['StartDate'],df['EndDate'])]
df = df.explode('Dates')
df
So for example , if i have StartDate as 01-02-2020 00:00:00 and EndDate as 05-02-2020 00:00:00
As result i should get
All the result DateTime should be in same format as in MyData.Csv StartDate and EndDate
Only the StartDate will change , rest should be same
I tried doing it with date range. But am not getting any result. Can anyone please help me with this.
Thanks
My two cents: a very simple solution based only on functions from pandas:
import pandas as pd
# Format of the dates in 'MyData.csv'
DT_FMT = '%m-%d-%Y %H:%M:%S'
df = pd.read_csv('MyData.csv')
# Parse dates with the provided format
for c in ('StartDate', 'EndDate'):
df[c] = pd.to_datetime(df[c], format=DT_FMT)
# Create the DataFrame with the ranges of dates
date_df = pd.DataFrame(
data=[[d] + list(row[1:])
for row in df.itertuples(index=False, name=None)
for d in pd.date_range(row[0], row[1])],
columns=df.columns.copy()
)
# Convert dates to strings in the same format of 'MyData.csv'
for c in ('StartDate', 'EndDate'):
date_df[c] = date_df[c].dt.strftime(DT_FMT)
If df is:
StartDate EndDate A B C
0 2020-01-02 2020-01-06 ME ME ME
1 2021-05-15 2021-05-18 KI KI KI
then date_df will be:
StartDate EndDate A B C
0 01-02-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
1 01-03-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
2 01-04-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
3 01-05-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
4 01-06-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
5 05-15-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
6 05-16-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
7 05-17-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
8 05-18-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
Then you can save back the result to a CSV file with the to_csv method.
Does something like this achieve what you want?
from datetime import datetime, timedelta
date_list = []
for base, end in zip(df['StartDate'], df['EndDate']):
d1 = datetime.strptime(base, "%d-%m-%Y %H:%M:%S")
d2 = datetime.strptime(end, "%d-%m-%Y %H:%M:%S")
numdays = abs((d2 - d1).days)
basedate = datetime.strptime(base, "%d-%m-%Y %H:%M:%S")
date_list += [basedate - timedelta(days=x) for x in range(numdays)]
df['Dates'] = date_list
Actually the code you provided is working for me. I guess the only thing you need to change is the date formatting in reading and writing operations to make sure that is consistent with your requirements. In particular, you should leverage the dayfirst argument when reading and date_format when writing the output file. A toy example below:
Toy data
StartDate
EndDate
A
B
C
01-02-2020 00:00:00
06-02-2020 00:00:00
ME
ME
ME
01-04-2020 00:00:00
04-04-2020 00:00:00
PE
PE
PE
Sample code
import pandas as pd
s_dates = ['01-02-2020', '01-03-2020']
e_dates = ['01-04-2020', '01-05-2020']
df = pd.read_csv('dataSO.csv', parse_dates=[0,1], dayfirst=True)
cols = df.columns
df['Dates'] = [pd.date_range(x, y) for x , y in zip(df['StartDate'],df['EndDate'])]
df1 = df.explode('Dates')[cols]
df1.to_csv('resSO.csv', date_format="%d-%m-%Y %H:%M:%S", index=False)
And the output is what you described except for the fact that StartDate is also in datetime format. Does this answer you question?

How to get rid of MonthEnds type

I am trying to get the delta in months between a starting date and an ending date within Pandas DataFrame. The result is not totally satisfying...
First, the outcome is some sort of Datetime type in the form of <[value] * MonthEnds>. I can't use this to calculate with. First question is how to convert this to an integer. I tried the .n attribute but then I get the following error:
AttributeError: 'Series' object has no attribute 'n'
Second, the outcome is 'missing' one month. Can this be avoided by using another solution/method? Or should I just add 1 month to the answer?
To support my questions I created some simplified code:
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End'].dt.to_period('M') - df['Start'].dt.to_period('M'))
df
This results in:
Start End Duration
0 2020-01-01 2020-10-31 <9 * MonthEnds>
1 2020-02-01 2020-11-30 <9 * MonthEnds>
The preferred result would be:
Start End Duration
0 2020-01-01 2020-10-31 10
1 2020-02-01 2020-11-30 10
Subtract the start-date from the end-date and convert the time delta to months.
import pandas as pd
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End']-df['Start']).astype('<m8[M]').astype(int)+1
print(df)
Output:
Start End Duration
0 2020-01-01 2020-10-31 10
1 2020-02-01 2020-11-30 10
Try This
dates = [{'Start':'1-1-2020', 'End':'31-10-2020'}, {'Start':'1-2-2020', 'End':'30-11-2020'}]
df = pd.DataFrame(dates)
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
df['Duration'] = (df['End'] - df['Start']).apply(lambda x:x.days//30)
print(df)

Cannot compare type 'Timestamp' with type 'str' Pandas Python

I have two dataframes with datetime:
df["datetime"] = df[["date","time"]].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df["datetime"] = pd.to_datetime(df["datetime"], format='%Y-%m-%d %H:%M:%S')
and for the other:
df_labels.columns = ["start_date","start_time","end_date","end_time","mode"]
df_labels["start_datetime"] = df_labels[["start_date","start_time"]].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_labels["end_datetime"] = df_labels[["end_date","end_time"]].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_labels["start_datetime"] = df_labels["start_datetime"].str.replace("/","-")
df_labels["end_datetime"] = df_labels["end_datetime"].str.replace("/","-")
df_labels["start_datetime"] = pd.to_datetime(df_labels["start_datetime"], format='%Y-%m-%d %H:%M:%S')
df_labels["end_datetime"] = pd.to_datetime(df_labels["end_datetime"], format='%Y-%m-%d %H:%M:%S')
all of the above code ran successfully.
Example of df:
lat long u1 alt d date time datetime mode
0 39.921712 116.472343 0 13 39298.146204 2007-08-04 03:30:32 2007-08-04 03:30:32
1 39.921705 116.472343 0 13 39298.146215 2007-08-04 03:30:33 2007-08-04 03:30:33
2 39.921695 116.472345 0 13 39298.146227 2007-08-04 03:30:34 2007-08-04 03:30:34
3 39.921683 116.472342 0 13 39298.146238 2007-08-04 03:30:35 2007-08-04 03:30:35
4 39.921672 116.472342 0 13 39298.146250 2007-08-04 03:30:36 2007-08-04 03:30:36
Example of df_labels:
start_date start_time end_date end_time mode start_datetime end_datetime
0 2007/06/26 11:32:29 2007/06/26 11:40:29 bus 2007-06-26 11:32:29 2007-06-26 11:40:29
1 2008/03/28 14:52:54 2008/03/28 15:59:59 train 2008-03-28 14:52:54 2008-03-28 15:59:59
2 2008/03/28 16:00:00 2008/03/28 22:02:00 train 2008-03-28 16:00:00 2008-03-28 22:02:00
3 2008/03/29 01:27:50 2008/03/29 15:59:59 train 2008-03-29 01:27:50 2008-03-29 15:59:59
4 2008/03/29 16:00:00 2008/03/30 15:59:59 train 2008-03-29 16:00:00 2008-03-30 15:59:59
However, when I run this:
for index, row in df_labels.iterrows():
df.loc[(df["datetime"] >= row["start_datetime"]) & (df["datetime"] < row["end_datetime"])] = row["mode"]
I get the following error:
TypeError: Cannot compare type 'Timestamp' with type 'str'
Please advise
Considering: datetime values are in this dd/mm/yy hh:mm:ss format.
df['datetime'] = pd.to_datetime(df['datetime'], format='%d/%m/%y %H:%M:%S')
df_labels["start_datetime"] = pd.to_datetime(df_labels["start_datetime"], format='%d/%m/%y %H:%M:%S')
df_labels["end_datetime"] = pd.to_datetime(df_labels["end_datetime"], format='%%d/%m/%y %H:%M:%S')
Ensure the data dtypes:
df.dtypes
df_label.dtypes
datetime column should show datetime64[ns] when properly converted
Additional (efficiency):
import numpy as np
import pandas as pd
import pandasql as ps
from pandas import Timestamp
from pandasql import sqldf
import sqlite3
conn = sqlite3.connect(':memory:')
##### write the tables
df.to_sql('df', conn, index=False)
df_label.to_sql('df', conn, index=False)
qry = '''
select *
from df
inner join
(select mode df_label_mode, start_date, end_date from df_label) df_label
on (df.datetime between df_label.start_date and df_label.end_date)
'''
df_x = pd.read_sql_query(qry, conn)
df_x.head()
Reference: Converting date column

Why does pd.date_range return an empty sequence?

pd.date_range is not accepting string variables for start and end date.
I am getting my start and end date as variables from another dataframe:
start_date = yoy_traffic_df['dt'].iloc[0]
end_date = yoy_traffic_df['dt'].iloc[-1]
print(yoy_traffic_df['dt'].iloc[[0, -1]].to_dict())
{0: '2018-09-14', 5567: '2018-03-28'}
The start_date and end_date are type string:
print(type(start_date),type(end_date))
<class 'str'> <class 'str'>
print(end_date,start_date)
2018-09-14 2018-03-28
dates = pd.Series(pd.date_range(start=start_date, end=end_date, freq='D'))
Series([], dtype: datetime64[ns])
If I set the variables as specific dates it pd.date_range gives the wanted output
start_date = '2018-03-28'
end_date = '2018-09-14'
d = pd.Series(pd.date_range(start=start_date, end=end_date, freq='D'))
d.head()
0 2018-03-28
1 2018-03-29
2 2018-03-30
3 2018-03-31
4 2018-04-01
dtype: datetime64[ns]
The expected output is a series.head() like
0 2018-03-28
1 2018-03-29
2 2018-03-30
3 2018-03-31
4 2018-04-01
It appears end_date is earlier than start_date.
start_date = yoy_traffic_df['dt'].iloc[0]
end_date = yoy_traffic_df['dt'].iloc[-1]
start_date < end_date
# False
So the date_range turns up empty:
pd.date_range(start_date, end_date)
# DatetimeIndex([], dtype='datetime64[ns]', freq='D')
This is similar to how python's range works as well (list(range(5, 1)) prints out an empty list). The best fix to do is to either call min or max,
start_date, end_date = yoy_traffic_df['dt'].min(), yoy_traffic_df['dt'].max()
Although if you specifically want the first and the end (not necessarily min or max), write some code to swap them.
if start_date > end_date:
start_date, end_date = end_date, start_date

How to find the difference between two formatted dates in days?

I have a pandas DataFrame with the following content:
df =
start end
01/April 02/May
12/April 12/April
I need to add a column with the difference (in days) between end and start values (end - start).
How can I do it?
I tried the following:
import pandas as pd
df.startdate = pd.datetime(df.start, format='%B/%d')
df.enddate = pd.datetime(df.end, format='%B/%d')
But not sure if this is a right direction.
import pandas as pd
df = pd.DataFrame({"start":["01/April", "12/April"], "end": ["02/May", "12/April"]})
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
df["diff"] = (df["end"] - df["start"])
Output:
end start diff
0 2018-05-02 2018-04-01 31 days
1 2018-04-12 2018-04-12 0 days
This is one way.
df['start'] = pd.to_datetime(df['start']+'/2018', format='%d/%B/%Y')
df['end'] = pd.to_datetime(df['end']+'/2018', format='%d/%B/%Y')
df['diff'] = df['end'] - df['start']
# start end diff
# 0 2018-04-01 2018-05-02 31 days
# 1 2018-04-12 2018-04-12 0 days

Categories

Resources