convert a pandas column of string dates to compare with datetime.date - python

I have a column of string values in pandas as follows:
2022-07-01 00:00:00+00:00
I want to compare it to a couple of dates as follows:
month_start_date = datetime.date(start_year, start_month, 1)
month_end_date = datetime.date(start_year, start_month, calendar.monthrange(start_year, start_month)[1])
df = df[(df[date] >= month_start_date) and (df[date] <= month_end_date)]
How do i convert the string value to datetime.date?
I have tried to use pd.to_datetime(df['date']), says cant compare datetime to date
Tried to use pd.to_datetime(df['date']).dt.date says dt can only be used for datetime l like variables, did you mean at
Also tired to normalize it, but that bring more errors with timezone, and active and naive timezone
Also tried .astype('datetime64[ns]')
None of it is working
UPDATE
Turns out none of the above are working because half the data is in this format: 2022-07-01 00:00:00+00:00
And the rest is in this format: 2022-07-01
Here is how i am getting around this issue:
for index, row in df_uscis.iterrows():
df_uscis.loc[index, 'date'] = datetime.datetime.strptime(row['date'].split(' ')[0], "%Y-%m-%d").date()
Is there a simpler and faster way of doing this? I tried to make a new column with the date values only, but not sure how to do that

From your update, if you only need to turn the values from string to date objects, you can try:
df['date'] = pd.to_datetime(df['date'].str.split(' ').str[0])
df['date'] = df['date'].dt.date
Also, try to avoid using iterrows, as it is really slow and usually there's a better way to achieve what you're trying to acomplish, but if you really need to iterate through a DataFrame, try using the df.itertuples() method.

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

trying to subtract two datetimes

Ok so I am trying to subtract the next time from the previous time in a dataframe column called local_time as indicated by this code. I have also tried this using list comprehension.
next_df = df.shift(-1)
def time_between (df):
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
df['time_diff'] = df.apply(time_between, axis = 1)code here
however I recieve this error when trying to subtract:
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
TypeError: combine() argument 2 must be datetime.time, not Series
You might try if pd.DataFrame.diff will work with datetime data. Assuming your data types are correct, date arithmetic should work fine.
Otherwise, you need to do vectorized calculations using whole arrays as each element in your arithmetic. Also use the dt accessors native to pandas, like pandas.Series.dt.date
instead of date.today(), you can use df['today'] = date.today() and df['today'].dt.date + df['Local Time'].dt.time. 90% sure that will yield a datetime column. If so, you could then use df.diff() pretty easily.

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

How to pop out the error-causing date records using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({'date': ['45:42.7','11/1/2012 0:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
I would like to convert the date column to type datetime.
So, I tried the below
df['date'] = pd.to_datetime(df['date'])
I get the below error
ValueError: hour must be in 0..23
As we can see from the sample dataframe NA is not causing this error but the 1st record which is 45:42.7.
While the raw excel file displays this date value 45:42.7 when I open the file but when I double click the cell, it displays correctly the actual date.
How can I filter the dataframe to pop-out the first record as output (which is the error causing record)?
I expect my output to be like shown in sample dataframe below
df = pd.DataFrame({'error_date': ['45:42.7']})
First if need to see wrong values convert to datetimes and filter missing values like:
print(df[pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce').isna()])
I think None is no problem, you need specify column format and for not matched rows are generated NaNs if add errors='coerce' parameter:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce')
print (df)
date
0 2012-03-06 08:57:00
1 2012-01-11 00:00:00
2 2012-01-20 02:48:00
3 2012-01-15 00:00:00
4 NaT
The Error is caused by using something like 24:00.
Testing with (note the change in the second entry to 24:00):
df = pd.DataFrame({'date': ['6/3/2012 8:57','11/1/2012 24:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
We receive the same error as in your big dataframe. Going trough with a for loop maybe a bit slower but this way we can catch the errors.
wrong_datetime_list = []
for index, value in enumerate(df['date']):
try:
df.loc[index,'date']= pd.to_datetime(df.loc[index,'date'])
except:
wrong_datetime_list.append((index, value))

Pandas dtype('O') to Date format

I believe I have tried every answer I could find on this page. The problem I am still having is converting a column to date format, and deleting the time from it.
This is my data frame, mixed datetimes and integers in passed array is the error I am getting with most of the methods I tried. I also failed converting it to a string -> date. Not sure how to approach this, any help would be appreciated.
This way you can convert to date and can also add other conditions
import datetime
def convert_to_date(x):
return datetime.datetime.strptime(x , '%Y-%m-%d %H:%M:%S')
df.loc[:, 'Date'] = df['Date'].apply(convert_to_date)
# better to use a lambda function
df.loc[:, 'Date'] = df['Date'].apply(lambda x:datetime.datetime.strptime(x , '%Y-%m-%d %H:%M:%S'))

Categories

Resources