formatting really inconsistent dates with python

formatting really inconsistent dates with python - python

I have some really messed up dates that I'm trying to get into a consistent format %Y-%m-%d if it applies. Some of the dates lack the day, some of the dates are in the future or just plain impossible for those I'll just flag as incorrect. How might I tackle such inconsistencies with python?
sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00

you can use the dateutil parser if you want
from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
try:
print parse(d)
except Exception, err:
print 'couldn\'t parse', d, err
outputs
1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month
if you would like to flag any that arent an easy parse you can check to see if they have 3 parts to parse and if they do try and parse it or else flag it like so
flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
try:
a = None
for s in splitters:
if len(d.split(s)) == 3:
a = parse(d)
good.append(a)
if not a:
raise Exception
except Exception, err:
flagged.append(d)

Some of the values are ambiguous. You can get different result depending on priorities e.g., if you want all dates to be treated consistently; you could specify a list of formats to try:
#!/usr/bin/env python
import re
import sys
from datetime import datetime
for line in sys.stdin:
date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
try:
print(datetime.strptime(date_string, date_format).date())
break
except ValueError:
pass
else: # no break
sys.stderr.write("failed to parse " + line)
Example:
$ python . <input.txt
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01
You could use other criteria e.g., you could maximize number of dates that are parsed successfully even if some dates are treated inconsistently instead (dateutil, pandas solution might give solutions in this category).

pd.datetools.to_datetime will have a go at guessing for you, it seems to go ok with most of your your dates, although you might want to put in some additional rules?
df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]:
0 1997-07-04 00:00:00
1 2002-08-31 00:00:00
2 1995-05-20 00:00:00
3 1992-05-12 00:00:00
4 2015-06-13 00:00:00
5 1998-08-04 00:00:00
6 90/1/90
7 1977-03-10 00:00:00
8 2015-12-07 00:00:00
9 NaN
10 1998-04-03 00:00:00
11 1976-08-01 00:00:00
12 1990-03-01 00:00:00
13 2015-09-01 00:00:00
14 1974-04-01 00:00:00
15 2003-10-10 00:00:00
16 Dec-00
Name: sample, dtype: object

Related

Inconsistency when parsing year-weeknum string to date

When parsing year-weeknum strings, I came across an inconsistency when comparing the results from %W and %U (docs):
What works:
from datetime import datetime
print("\nISO:") # for reference...
for i in range(1,8): # %u is 1-based
print(datetime.strptime(f"2019-01-{i}", "%G-%V-%u"))
# ISO:
# 2018-12-31 00:00:00
# 2019-01-01 00:00:00
# 2019-01-02 00:00:00
# ...
# %U -> week start = Sun
# first Sunday 2019 was 2019-01-06
print("\n %U:")
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%U-%w"))
# %U:
# 2019-01-06 00:00:00
# 2019-01-07 00:00:00
# 2019-01-08 00:00:00
# ...
What is unexpected:
# %W -> week start = Mon
# first Monday 2019 was 2019-01-07
print("\n %W:")
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%W-%w"))
# %W:
# 2019-01-13 00:00:00 ## <-- ?! expected 2019-01-06
# 2019-01-07 00:00:00
# 2019-01-08 00:00:00
# 2019-01-09 00:00:00
# 2019-01-10 00:00:00
# 2019-01-11 00:00:00
# 2019-01-12 00:00:00
The date jumping from 2019-01-13 to 2019-01-07? What's going on here? I don't see any ambiguities in the calendar for 2019... I also tried to parse the same dates in rust with chrono, and it fails for the %W directive -> playground example. A jump backwards in Python and an error in Rust, what am I missing here?

That week goes from Monday January 7 to Sunday January 13.
%w is documented as "Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.". So 0 means Sunday (= January 13), and 1 means Monday (= January 7).

In your code, you're trying to parse the string "2019-01-0" as a day of a year, which is not a valid day. That's why you're encountering an unexpected result when using the %W format code.
If you want to parse a date, you should specify a value that is bigger then 1 not 0.
Also keep it might help to keep the style consistent with f-string
f'(2019-01-{i:02d}')
which will add the leading 0 when necessary like the following.
2019-01-00
2019-01-01
2019-01-02
2019-01-03
2019-01-04
2019-01-05
2019-01-06
Here is your modified code:
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%W-%w"))

Convert a column to a specific time format which contains different types of time formats in python

This is my data frame
df = pd.DataFrame({
'Time': ['10:00PM', '15:45:00', '13:40:00AM','5:00']
})
Time
0 10:00PM
1 15:45:00
2 13:40:00AM
3 5:00
I need to convert the time format in a specific format which is my expected output, given below.
Time
0 22:00:00
1 15:45:00
2 01:40:00
3 05:00:00
I tried using split and endswith function of str which is a complicated solution. Is there any better way to achieve this?
Thanks in advance!

here you go. One thing to mention though 13:40:00AM will result in an error since 13 is a) wrong format as AM/PM only go from 1 to 12 and b) PM (which 13 would be) cannot at the same time be AM :)
Cheers
import pandas as pd
df = pd.DataFrame({'Time': ['10:00PM', '15:45:00', '01:40:00AM', '5:00']})
df['Time'] = pd.to_datetime(df['Time'])
print(df['Time'].dt.time)
<<< 22:00:00
<<< 15:45:00
<<< 01:45:00
<<< 05:00:00

Pandas - "time data does not match format " error when the string does match the format?

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")

There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go

You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

Cleaning date column imported from excel

So I have this data set:
1.0 20/20/1999
2.0 31/2014
3.0 2015
4.0 2008-01-01 00:00:00
5.0 1903-10-31 00:00:00
6.0 1900-01-20 00:00:00
7.0 2011-02-21 00:00:00
8.0 1999-10-11 00:00:00
Those dates imported from excel but since the dataset is large and from multiple sources I can have any number of yyyy-mm-dd permutations with - or / or none as separators and missing months or days. It's a nightmare.
I want to keep those valid formats while those that are not recognized as valid should return a year or nothing.
This is where I got so far:
I import as is from excel
df['date_col'].date_format('%Y-%m-%d')
I found regex to match only year field but I'm stuck on with what to use it on ^[0-9]{2,2}$
I have tried dateutil without success. It's refusing to parse examples with month only

I'm not familiar with a DataFrame or Series method called date_format, and your regex doesn't seem to return the year for me. That aside I would suggest defining a function that can handle any of these formats and map it along the date column. Like so:
df
date
0 20/20/1999
1 31/2014
2 2015
3 2008-01-01 00:00:00
4 1903-10-31 00:00:00
5 1900-01-20 00:00:00
6 2011-02-21 00:00:00
7 1999-10-11 00:00:00
def convert_dates(x):
try:
out = pd.to_datetime(x)
except ValueError:
x = re.sub('^[0-9]{,2}/', '', x)
out = pd.to_datetime(x)
return out
df.date.map(convert_dates)
0 1999-01-01
1 2014-01-01
2 2015-01-01
3 2008-01-01
4 1903-10-31
5 1900-01-20
6 2011-02-21
7 1999-10-11
Name: date, dtype: datetime64[ns]
Granted this function doesn't handle strings that don't contain a year, but your sample fails to include an example of this.

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

I am trying to parse a date column that looks like the one below,
date
061116
061216
061316
061416
However I cannot get pandas to accept the date format as there is no delimiter (eg '/'). I have tried this below but receive the error:
ValueError: time data '060116' does not match format '%dd%mm%YY' (match)
pd.to_datetime(df['Date'], format='%dd%mm%YY')

You need add parameter errors='coerce' to_datetime, because 13 and 14 months does not exist, so this dates are converted to NaT:
print (pd.to_datetime(df['Date'], format='%d%m%y', errors='coerce'))
0 2016-11-06
1 2016-12-06
2 NaT
3 NaT
Name: Date, dtype: datetime64[ns]
Or maybe you need swap months with days:
print (pd.to_datetime(df['Date'], format='%m%d%y'))
0 2016-06-11
1 2016-06-12
2 2016-06-13
3 2016-06-14
Name: Date, dtype: datetime64[ns]
EDIT:
print (df)
Date
0 0611160130
1 0612160130
2 0613160130
3 0614160130
print (pd.to_datetime(df['Date'], format='%m%d%y%H%M', errors='coerce'))
0 2016-06-11 01:30:00
1 2016-06-12 01:30:00
2 2016-06-13 01:30:00
3 2016-06-14 01:30:00
Name: Date, dtype: datetime64[ns]
Python's strftime directives.

Your date format is wrong. You have days and months reversed. It should be:
%m%d%Y

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

formatting really inconsistent dates with python - python

Related

Inconsistency when parsing year-weeknum string to date

Convert a column to a specific time format which contains different types of time formats in python

Pandas - "time data does not match format " error when the string does match the format?

Cleaning date column imported from excel

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

Categories

Resources