Multiple date formats, towards one proper format

Multiple date formats, towards one proper format - python

I have a date columns with multiple dates:
Date
2022-01-01 00:00:00
jan 20
january 19
How can I convert them, in a scalable way (without dictionary), to a proper date time format?
I tried:
df['Date_1'] = pd.to_datetime(df['Date'], errors='coerce').astype(str)
df['Date_2'] = pd.to_datetime(df['Date'], errors='coerce', ,yearfirst = False, format = '%B %y')).astype(str)
df['Date1'] = df['Date1'].str.replace('NaT','')
df['Date2'] = df['Date2'].str.replace('NaT','')
Then, I merged the two columns, with:
df['Date3'] = df['Date1'] + df['Date2']
But, this is not working, since I need to create another format (for the not-abbreviation months).
But when adding the logic above, but then changing the %B for %b, it is duplicating some months (like may, which is both: an abbreviation and full month).
I would like to have the end result:
2022-01-01
2020-01-01
2019-01-01

There is no direct way to handle all formats at once.
What you can do is use successive methods. Here I combined the "january 19" and "jan 20" using a regex. You can use additional .fillna(<new_converter>) should you find more formats in the future.
(pd
.to_datetime(df['Date'], errors='coerce')
.fillna(pd.to_datetime(df['Date'].str.replace('([a-z]{3})[a-z]+', r'\1', regex=True),
errors='coerce', yearfirst=False, format='%b %y')
)
)
output:
0 2022-01-01
1 2020-01-01
2 2019-01-01
Name: Date, dtype: datetime64[ns]

Use combine_first to try it with a variety of different date formats:
date = pd.to_datetime(df["Date"], errors="coerce")
for format in ["%b %y", "%B %y"]:
date = date.combine_first(pd.to_datetime(df["Date"], format=format, errors="coerce"))
df["Date"] = date

Related

Pandas int type to date type

i am new to pandas and I try to convert an int type-column to an date type-column .
The int in the df is something like: 10712 (first day, then month, then year).
I tried solving this with:
df_date = pd.to_datetime(df['Date'], format='%d%m%Y')
but I always get the following value error:
time data '10712' does not match format '%d%m%Y' (match)
Thank you for your help :)

You should use %y (2-digit year) instead of %Y (4-digit year). But that is not enough.
The format %d%m%y converts 10712 to 10-07-2012, not to 1-07-2012 as you expect.
That's because of the following feature of the underlying strptime:
When used with the strptime() method, the leading zero is optional for
%m
A workaround could be to convert to a format properly understandable by strptime (and to_datetime):
>>> df = pd.DataFrame({'date': [10712, 20813, 30914]})
>>> df
date
0 10712
1 20813
2 30914
>>> df1 = df.date.astype(str).str.replace('(\d+)(\d\d)(\d\d)',
r'\2/\1/\3', regex=True)
>>> df1
0 07/1/12
1 08/2/13
2 09/3/14
>>> pd.to_datetime(df1)
0 2012-07-01
1 2013-08-02
2 2014-09-03

Use %y year specifier to parse year without century digits:
In [654]: pd.to_datetime(10712, format='%d%m%y')
Out[654]: Timestamp('2012-07-10 00:00:00')

How to convert all column values to dates?

I'm trying to convert all data in a column from the below to dates.
Event Date
2020-07-16 00:00:00
31/03/2022, 26/11/2018, 31/01/2028
This is just a small section of the data - there are more columns/rows.
I've tried to split out the cells with multiple values using the below:
df["Event Date"] = df["Event Date"].str.replace(' ', '')
df["Event Date"] = df["Event Date"].str.split(",")
df= df.explode("Event Date")
The issue with this is it sets any cell without a ',' e.g. '2020-07-16 00:00:00' to NaN.
Is there any way to separate the values with a ',' and set the entire column to date types?

You can use combination of split and explode to separate dates and then use infer_datetime_format to convert mixed date types
df = df.assign(dates=df['dates'].str.split(',')).explode('dates')
df
Out[18]:
dates
0 2020-07-16 00:00:00
1 31/03/2022
1 26/11/2018
1 31/01/2028
df.dates = pd.to_datetime(df.dates, infer_datetime_format=True)
df.dates
Out[20]:
0 2020-07-16
1 2022-03-31
1 2018-11-26
1 2028-01-31
Name: dates, dtype: datetime64[ns]

Here is a proposition with pandas.Series.str.split and pandas.Series.explode :
s_dates = (
df["Event Date"]
.str.split(",")
.explode(ignore_index=True)
.apply(pd.to_datetime, dayfirst=True)
)
Output :
0 2020-07-16
1 2022-03-31
2 2018-11-26
3 2028-01-31
Name: Event Date, dtype: datetime64[ns]

Your example table shows mixed date formats in each row. The idea is to try a date parsing technique and then try another if it fails. Using loops and having such wide variations of data types are red flags with a script design. I recommend using datetime and dateutil to handle the dates.
from datetime import datetime
from dateutil import parser
date_strings = ["2020-07-16 00:00:00", "31/03/2022, 26/11/2018, 31/01/2028"] % Get these from your table.
parsed_dates = []
for date_string in date_strings:
try:
# strptime
date_object = datetime.strptime(date_string, "%Y-%m-%d %H:%M:%S")
parsed_dates.append(date_object)
except ValueError:
# parser.parse() and split
date_strings = date_string.split(",")
for date_str in date_strings:
date_str = date_str.strip()
date_object = parser.parse(date_str, dayfirst=True)
parsed_dates.append(date_object)
print(parsed_dates)
Try the code on Trinket: https://trinket.io/python3/95c0d14271

Pandas Date Formatting (With Optional Milliseconds)

I'm getting data from an API and putting it into a Pandas DataFrame. The date column needs formatting into date/time, which I am doing. However the API sometimes returns dates without milliseconds which doesn't match the format pattern. This results in an error:
time data '2020-07-30T15:57:37Z' does not match format '%Y-%m-%dT%H:%M:%S.%fZ' (match)
In this example, how can I format the date column to date/time, so all dates are formatted with milliseconds?
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ')
print(df)

do it one time with milliseconds included and another time without milliseconds included. use errors='coerce' to return NaT when ValueError occurs.
with_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ',errors='coerce')
without_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%SZ',errors='coerce')
the results would be something like this:
with milliseconds:
0 NaT
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]
without milliseconds:
0 2020-07-30 15:57:37
1 NaT
Name: date, dtype: datetime64[ns]
then you can fill NaTs of one dataframe with values of the other because they complement each other.
with_miliseconds.fillna(without_miliseconds)
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]

To have a consistent format in your output DataFrame, you could run a Regex replacement before converting to a df for all values without mills.
dates = {'date': [re.sub(r'Z', '.0Z', date) if '.' not in date else date for date in dates['date']]}
Since only those dates containing a . have mills, we can run the replacements on the others.
After that, everything else is the same as in your code.
Output:
date
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100

As your date string seems like the standard ISO 8601 you can just avoid the use of the format param. The parser will take into account that miliseconds are optional.
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'])
print(df)
date
0 2020-07-30 15:57:37+00:00
1 2020-07-30 15:57:37.100000+00:00

Convert one column to standard date format in Python

For a date column I have data like this: 19.01.01, which means 2019-01-01. Is there a method to change the format from the former to the latter?
My idea is to add 20 to the start of date and replace . with -. Are there better ways to do that?
Thanks.

If format is YY.DD.MM use %y.%d.%m, if format is YY.MM.DD use %y.%m.%d in to_datetime:
df = pd.DataFrame({'date':['19.01.01','19.01.02']})
#YY.DD.MM
df['date'] = pd.to_datetime(df['date'], format='%y.%d.%m')
print (df)
date
0 2019-01-01
1 2019-02-01
#YY.MM.DD
df['date'] = pd.to_datetime(df['date'], format='%y.%m.%d')
print (df)
date
0 2019-01-01
1 2019-01-02

Pandas - Different time formats in the same column

I have a Dataframe that has dates stored in different formats in the same column as shown below:
date
1-10-2018
2-10-2018
3-Oct-2018
4-10-2018
Is there anyway I could make all of them to have the same format.

Use to_datetime with specify formats with errors='coerce' for replace not matched values to NaNs. Last combine_first for replace missing values by date2 Series.
date1 = pd.to_datetime(df['date'], format='%d-%m-%Y', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%d-%b-%Y', errors='coerce')
df['date'] = date1.combine_first(date2)
print (df)
date
0 2018-10-01
1 2018-10-02
2 2018-10-03
3 2018-10-04

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple date formats, towards one proper format - python

Use combine_first to try it with a variety of different date formats: date = pd.to_datetime(df["Date"], errors="coerce") for format in ["%b %y", "%B %y"]: date = date.combine_first(pd.to_datetime(df["Date"], format=format, errors="coerce")) df["Date"] = date

Related

Pandas int type to date type

How to convert all column values to dates?

Pandas Date Formatting (With Optional Milliseconds)

Convert one column to standard date format in Python

Pandas - Different time formats in the same column

Categories

Resources