pandas handle column with different date time formats gracefully - python

I have a column with a birthdate. Some are N.A, some 01.01.2016 but some contain 01.01.2016 01:01:01
Filtering the N.A. values works fine. But handling the different date formats seems clumsy. Is it possible to have pandas handle these gracefully and e.g. for a birthdate only interpret the date and not fail?

pd.to_datetime() will handle multiple formats
>>> ser = pd.Series(['NaT', '01.01.2016', '01.01.2016 01:01:01'])
>>> pd.to_datetime(ser)
0 NaT
1 2016-01-01 00:00:00
2 2016-01-01 01:01:01
dtype: datetime64[ns]

Related

Change date column to datetime

I am working on a stock market analysis where I look at past Balance Sheets and income statements, and want to change the date column which saves them as a string of the form "2021-09-30" into datetimes. I am trying to use pd.to_datetime but it is giving me an error.
When I run
df['datekey'] = pd.to_datetime(df['datekey'], format='%Y-%m-%d')
I get
"ValueError: time data "2021-09-30" doesn't match format specified"
when it should (if I am doing this correctly).
This column doesn't have a time value in it. It is just (for all dates) "2021-09-30".
You have extra quotes and spaces in your data. Try:
df["datekey"] = pd.to_datetime(df["datekey"].str.replace(" ","").str.strip('"'), format="%Y-%m-%d")
>>> df["datekey"]
0 2021-09-30
1 2021-06-30
2 2021-03-31
3 2020-12-31
4 2020-09-30
5 2020-06-30
6 2020-03-31
7 2019-12-31
8 2019-09-30
9 2019-06-30
Name: datekey, dtype: datetime64[ns]
Seems like the value itself is enclosed by double quotes, you need to include quotes as well in your formats:
df['datekey'] = pd.to_datetime(df['datekey'], format='"%Y-%m-%d"')
Alternatively, you can strip off the quotes before converting to datetime, this is useful if some values are not enclosed by double quotes:
df['datekey'] = pd.to_datetime(df['datekey'].str.strip('"'), format='%Y-%m-%d')

How to check if a column has a particular Date format or not using DATETIME in python?

I am new to python. I have a data-frame which has a date column in it, it has different formats. I would like to check if it is following particular date format or not. I it is not following I want to drop it. I have tried using try except and iterating over the rows. But I am looking for a faster way to check if the column is following a particular date format or not. If it is not following then it has to drop. Is there any faster way to do it? Using DATE TIME library?
My code:
Date_format = %Y%m%d
df =
Date abc
0 2020-03-22 q
1 03-12-2020 w
2 55552020 e
3 25122020 r
4 12/25/2020 r
5 1212202033 y
Excepted out:
Date abc
0 2020-03-22 q
You could try
pd.to_datetime(df.Date, errors='coerce')
0 2020-03-22
1 2020-03-12
2 NaT
3 NaT
4 2020-12-25
5 NaT
It's easy to drop the null values then
EDIT:
For a given format you can still leverage pd.to_datetime:
datetimes = pd.to_datetime(df.Date, format='%Y-%m-%d', errors='coerce')
datetimes
0 2020-03-22
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
df.loc[datetimes.notnull()]
Also note I am using the format %Y-%m-%d which I think is the one you want based on your expected output (not the one you gave as Date_format)

Convert YYYYMMDD into YYYY-MM-DD and HHMMSS into HH:MM:SS for candlestick plotting

I've been trying to find an answer for 4 hours, but no luck. Any help will be very appreciable.
Goal: convert 20170103 into 2017-01-03 and 022100 into 02:21:00 for candlestick plotting
date_int = 20170103
df = pd.DataFrame({'date':[date_int]*10})
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
print(df['date'])
time_int = 020100
df = pd.DataFrame({'time':[time_int]*10})
df['time'] = df['time'].apply(lambda x: pd.to_datetime(str(x), format='%H:%M:%S'))
print(df['time'])
but the second code shows 'invalid token' error.
And I also notice that this code performs very slow. If there is a more efficient way, please, let me know. Thank you so much in advance for your help.
To expand on my comments, you have a few things wrong here. Firstly as mentioned, the used format in your second example is wrong. Your data has the format '%H%M%S', so it is the one you need to specify in the argument.
When using pd.to_datetime, the specified format indicates the actual data format so that it can be correctly parsed.
In order to further modify it, you need to add Series.dt.strftime:
date_int = 20170103
df = pd.DataFrame({'date':[date_int]*10})
df.date = pd.to_datetime(df.date, format='%Y%m%d').dt.strftime('%Y-%m-%d')
date
0 2017-01-03
1 2017-01-03
2 2017-01-03
3 2017-01-03
4 2017-01-03
5 2017-01-03
6 2017-01-03
7 2017-01-03
8 2017-01-03
9 2017-01-03
So similarly for your second example you need:
df.time = pd.to_datetime(df.time, format='%H%M%S').dt.strftime('%H:%M:%S')
Here, Based on my comment above. (for Invalid token error, make it string surrounded by single quote or double)
time_int = '020100'
df = pd.DataFrame({'time':[time_int]*10})
df['time'] = df['time'].apply(lambda x: pd.to_datetime(str(x), format='%H%M%S'))
df['time'] = df['time'].dt.time
print(df['time'])
Output:
0 02:01:00
1 02:01:00
2 02:01:00
3 02:01:00
4 02:01:00
5 02:01:00
6 02:01:00
7 02:01:00
8 02:01:00
9 02:01:00
I'm looking at the question and it looks like the original question was two test cases to get code using the panda package debugged. The comment that the code ran slowly suggests that a file of dates and times is being read. Given that candlestick plots could be used with a datetime object, perhaps this all could be solved simply.
Reading each line pull the date and time out as a single string, say '20170103 022100'.
Use datetime to parse directly to a datetime object.
import datetime as dt
ts='20170103 022100'
result=dt.datetime.strptime(ts,'%Y%m%d %H%M%S')
What's nice about strptime is that the single space in the format represents whitespace, so the multiple spaces in the string parse correctly.
Hope that simplifies things.

Setting a dataframe columns to type datetime when there are blanks in the columns

I have a dataframe (df) with two columns where the head looks like
name start end
0 John 2018-11-09 00:00:00 2012-03-01 00:00:00
1 Steve 1990-09-03 00:00:00
2 Debs 1977-09-07 00:00:00 2012-07-02 00:00:00
3 Mandy 2009-01-09 00:00:00
4 Colin 1993-08-22 00:00:00 2002-06-03 00:00:00
The start and end columns have the type object. I want to change the type to datetime so I can use the following:
referenceError = DeptTemplate['start'] > DeptTemplate['end']
am trying to change the type using:
df['start'].dt.strftime('%d/%m/%Y')
df['end'].dt.strftime('%d/%m/%Y')
but I think where there are some rows where there are no date in the columns its causing a problem. How can I set any blank values so I can change the type to date time and run my analysis?
As shown in the .to_datetime docs you can set the behavior using the errors kwarg. You can also set the strftime format with the format kwarg.
# Bad values will be NaT
df["start"] = pd.to_datetime(df.start, errors='coerce', format='%d/%m/%Y')
As mentioned in the comments, you can prepare the column with replace if you absolutely must use strftime.

How do I prevent pandas.to_datetime() function from converting 0001-01-01 to 2001-01-01

I have read-only access to a database that I query and read into a Pandas dataframe using pymssql. One of the variables contains dates, some of which are stored as midnight on 01 Jan 0001 (i.e. 0001-01-01 00:00:00.0000000). I've no idea why those dates should be included – as far as I know, they are not recognised as a valid date by SQL Server and they are probably due to some default data entry. Nevertheless, that's what I have to work with. This can be recreated as a dataframe as follows:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [0,1,2,3,4],
'date': ['0001-01-01 00:00:00.0000000',
'2015-05-22 00:00:00.0000000',
'0001-01-01 00:00:00.0000000',
'2015-05-06 00:00:00.0000000',
'2015-05-03 00:00:00.0000000']})
The dataframe looks like:
print(tempDF)
date id
0 0001-01-01 00:00:00.0000000 0
1 2015-05-22 00:00:00.0000000 1
2 0001-01-01 00:00:00.0000000 2
3 2015-05-06 00:00:00.0000000 3
4 2015-05-03 00:00:00.0000000 4
... with the following dtypes:
print(tempDF.dtypes)
date object
id int64
dtype: object
print(tempDF.dtypes)
However, I routinely convert date fields in the dataframe to datetime format using:
tempDF['date'] = pd.to_datetime(tempDF['date'])
However, by chance, I've noticed that the 0001-01-01 date is converted to 2001-01-01.
print(tempDF)
date id
0 2001-01-01 0
1 2015-05-22 1
2 2001-01-01 2
3 2015-05-06 3
4 2015-05-03 4
I realise that the dates in the original database are incorrect because SQL Server doesn't see 0001-01-01 as a valid date. But at least in the 0001-01-01 format, such missing data are easy to identify within my Pandas dataframe. However, when pandas.to_datetime() changes these dates so they lie within a feasible range, it is very easy to miss such outliers.
How can I make sure that pd.to_datetime doesn't interpret the outlier dates incorrectly?
If you provide a format, these dates will not be recognized:
In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce')
Out[92]:
0 NaT
1 2015-05-22
2 NaT
3 2015-05-06
4 2015-05-03
Name: date, dtype: datetime64[ns]
By default it will error, but by passing errors='coerce', they are converted to NaT values (coerce=True for older pandas versions).
The reason pandas converts these "0001-01-01" dates to "2001-01-01" without providing a format, is because this is the behaviour of dateutil:
In [32]: import dateutil
In [33]: dateutil.parser.parse("0001-01-01")
Out[33]: datetime.datetime(2001, 1, 1, 0, 0)

Categories

Resources