Confusing day and month when parsing dates in Python - python

I want to parse date each time the same way. I wrote a code that detects date in string, but it parses the date differently. My date is 3rd of November 2020, and it mixes the days and months:
import datefinder
import time
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
matches = datefinder.find_dates(sample)
matches = list(matches)
print(matches)
print(matches[0].strftime("%Y-%m-%d"))
print()
How can I fix this?
I want to parse date each time the same way, regardless of the format in string.
can you show me how can I do that (i do not care what lib should i use)?

As I previously stated in your other question about extracting dates, there is no universal date extraction that can handle every date/time format. Date cleaning in your case will require a multi-pronged approach, which I have outlined in the examples below:
Updated 12-06-2020
I'm going to assume that some of the dates within sample_dates are from the German language websites that you're scraping. So this code below can parse those dates.
Please review the parameters for dateutil.parser.parse.
import datefinder
import re as regex
from datetime import datetime
import dateutil.parser as dparser
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
check_year_first = regex.search(r'\d{4}\W\d{1,2}\W\d{1,2}', sample)
if check_year_first:
date_strings = datefinder.find_dates(sample)
for date_string in date_strings:
reformatted_date = datetime.strptime(str(date_string), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
print(reformatted_date)
else:
date_strings = datefinder.find_dates(sample, source=True)
for date_string in date_strings:
reformatted_date = dparser.parse(str(date_string[1]).replace('date', ''), dayfirst=True).strftime('%Y-%m-%d')
print(reformatted_date)
I'm also going to assume that the dates in this question are being produced from the output of your question Extract date from multiple webpages with Python. If that is correct then add this code to the other code that I provided for that question. Like I stated before my other code can be modified to fit your data extraction and cleaning requirements.
import dateparser
from datetime import datetime
def reformatted_dates(date_string):
# date format 3. Nov. 2020
date_format_01 = regex.search(r'\d{1,2}\W\s\w+\W\s\d{4}', date_string)
# date format 3. Dezember 2020
date_format_02 = regex.search(r'\d{1,2}\W\s\w+\s\d{4}', date_string)
if date_format_01:
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
elif date_format_02:
pass
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
else:
# date format 18.11.2020
reformatted_date = datetime.strptime(date_string, '%d.%m.%Y').strftime('%m-%d-%Y')
return reformatted_date

Related

Getting exception : Unknown string format:

I am trying to extract dates from text using dateutil library (Python 3.7)
I want to extract all dates from text using code below.
import dateutil.parser as dparser
text = 'First date is 10 JANUARY 2000 and second date is 31/3/2000'
dt = dparser.parse(text, fuzzy=True, dayfirst=True, default=datetime.datetime(1800, 1, 1))
But getting following exception
Unknown string format: First date is 10 JANUARY 2000 and second date
is 31/1/2000
Please let me know any way to extract multiple dates in the text.
How about using datefinder?
import datefinder
string_with_dates = '''
First date is 10 JANUARY 2000 and second date is 31/3/2000
'''
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
Output:
2000-01-10 00:00:00
2000-03-31 00:00:00

Convert Shell date format to Python date format

Can below piece of shell date format be converted to python date format?
date_used = $(date --date="$(date +%Y-%m-15) - 1 month" "+%Y_%m")
As per my understanding this above format is just taking day as 15 of current month and it simply subtracts 1 month and results in giving output in the format of year_month.
Output of the above shell date format -->
echo $date_used = 2022_05
Can this particular scenario be done using python?
Any update about it would be really appreciable.
An equivalent would be:
from datetime import datetime,timedelta
# Current date, replace date with 15, subtract 30 days and format
date_used = (datetime.now().replace(day=15) - timedelta(days=30)).strftime('%Y_%m')
print(date_used)
Output:
2022_05
You can use python's datetime module to do this
it has a function named strptime you can use to read date and time data with format code very similar to unix date format (or maybe its even same i'm not sure)
and strftime to output in a certain format as well
you can see the functions and the format code to see if there's any different on
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
Example code
from datetime import datetime, timedelta
date = datetime.strptime((datetime.now() - timedelta(days=30)).strftime("%Y-%m") + "-15", "%Y-%m-%d")
print(date)
print(date.strftime("%Y_%m"))
output
2022-05-15 00:00:00
2022_05

Python regex for date and numbers, find the date format

How to extract dates alone from text file using regex in Python 3?
Below is my current code:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on
09/07/1897"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print (date)
Expected Output is
20/12/2018
04/01/1997
09/07/1897
Building on DirtyBit's answer. I found that if you make a minor change, it will pick up multiple date formats. Change the forward slash to a dot.
import re
s = "birthday on 20.12.2018 and wedding anniversary on 04-01-1997 and dob
is on 09/07/1897"
pattern = r'\d{2}.\d{2}.\d{4}'
print("\n".join(re.findall(pattern,s)))
Output
20.12.2018
04-01-1997
09/07/1897
You have an invalid date format near '%Y-%m-%d' since it should have been '%d/%m/%Y' looking at your provided date: birthday on 20/12/2018 (dd/mm/yyyy)
Change this:
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
With this:
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
Your Fix:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
print (date)
But:
Why get into all the trouble? When they're easier and elegant ways out there.
Using dparser:
import dateutil.parser as dparser
dt_1 = "birthday on 20/12/2018"
print("Date: {}".format(dparser.parse(dt_1,fuzzy=True).date()))
OUTPUT:
Date: 2018-12-20
EDIT:
With your edited question which now has multiple dates, you could extract them using regex:
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on 09/07/1897"
pattern = r'\d{2}/\d{2}/\d{4}'
print("\n".join(re.findall(pattern,s)))
OUTPUT:
20/12/2018
04/01/1997
09/07/1897
OR
Using dateutil:
from dateutil.parser import parse
for s in s.split():
try:
print(parse(s))
except ValueError:
pass
OUTPUT:
2018-12-20 00:00:00
1997-04-01 00:00:00
1897-09-07 00:00:00
You are doing everything right expect this line,
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
You have to give the same format as your input has in datetime.strptime.
'%Y-%m-%d' >> 2018-12-20
'%d/%m/%Y' >> 20/12/2018
Edit
If you are not looking for datetime object. You can do like this
results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
print('\n'.join(results))
Output
In [20]: results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
In [21]: print('\n'.join(results))
20/12/2018
04/01/1997
09/07/1897

converting written date to date format in python

I am using Python 2.7.
I have an Adobe PDF form doc that has a date field. I extract the values using the pdfminer function. The problem I need to solve is, the user in Adobe Acrobat reader is allowed to type in strings like april 3rd 2017 or 3rd April 2017 or Apr 3rd 2017 or 04/04/2017 as well as 4 3 2017. Now the date field in Adobe is set to mm/dd/yyyy format, so when a user types in one of the values above, that is the actual value that pdfminer pulls, yet adobe will display it as 04/03/2017, but when you click on the field is shows you the actual value like the ones above. Adobe allows this and then doing it's on conversion I think to display the date as mm/dd/yyyy. There is ability to use javascript with adobe for more control, but i can't do that the users can only have and use the pdf form without any accompanying javascript file.
So I was looking to find a method with datetime in Python that would be able to accept a written date such as the examples above from a string and then convert them into a true mm/dd/yyyy format??? I saw methods for converting long and short month names but nothing that would handle day names like 1st,2nd,3rd,4th .
You could just try each possible format in turn. First remove any st nd rd specifiers to make the testing easier:
from datetime import datetime
formats = ["%B %d %Y", "%d %B %Y", "%b %d %Y", "%m/%d/%Y", "%m %d %Y"]
dates = ["april 3rd 2017", "3rd April 2017", "Apr 3rd 2017", "04/04/2017", "4 3 2017"]
for date in dates:
date = date.lower().replace("rd", "").replace("nd", "").replace("st", "")
for format in formats:
try:
print datetime.strptime(date, format).strftime("%m/%d/%Y")
except ValueError:
pass
Which would display:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
This approach has the benefit of validating each date. For example a month greater than 12. You could flag any dates that failed all allowed formats.
Just write a regular expression to get the number out of the string.
import re
s = '30Apr'
n = s[:re.match(r'[0-9]+', s).span()[1]]
print(n) # Will print 30
The other things should be easy.
Based on #MartinEvans's anwser, but using arrow library: (because it handles more cases than datetime so you don't have to use replace() nor lower())
First install arrow:
pip install arrow
Then try each possible format:
import arrow
dates = ['april 3rd 2017', '3rd April 2017', 'Apr 3rd 2017', '04/04/2017', '4 3 2017']
formats = ['MMMM Do YYYY', 'Do MMMM YYYY', 'MMM Do YYYY', 'MM/DD/YYYY', 'M D YYYY']
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
except arrow.parser.ParserError:
pass
[convert_datetime(date) for date in dates]
Will output:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
If you are unsure of what could be wrong in your date format, you can also output a nice error message if none of the date matches the format:
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
break
except (arrow.parser.ParserError, ValueError) as e:
pass
else:
print 'For date: "{0}", {1}'.format(date, e)
convert_datetime('124 5 2017') # test invalid date
Will output the following error message:
'For date: "124 5 2017", month must be in 1..12'

Convert date to months and year

From the following dates how to get months and year using python
dates are
"2011-07-01 09:26:11" //This showud display as "This month"
"2011-06-07 09:26:11" //This should display as June
"2011-03-12 09:26:11" //This should display as March
"2010-07-25 09:26:11" // Should display as last year
Please let me know how to get these formats using python 2.4
When I have a date variable[with same format] in pandas dataframe, how to perform this ?
import time
import calendar
current = time.localtime()
dates = ["2011-07-01 09:26:11", "2011-06-07 09:26:11", "2011-03-12 09:26:11", "2010-07-25 09:26:11"]
for date in dates:
print "%s" % date
date = time.strptime(date, "%Y-%m-%d %H:%M:%S")
if date.tm_year == current.tm_year - 1:
print "Last year"
elif date.tm_mon == current.tm_mon:
print "This month"
else:
print calendar.month_name[date.tm_mon]
Should do roughly what you ask for.
from datetime import datetime
a = datetime.strptime('2011-07-01 09:26:11', '%Y-%m-%d %H:%M:%S')
if a.month == datetime.now().month:
print 'This month'
The other results according to http://docs.python.org/library/datetime.html
Good site for python date/time conversions:
http://www.saltycrane.com/blog/2008/11/python-datetime-time-conversions/
You could use datetime.strptime to paste the date.
But for simple case like yours, i would just use regex.
>>>> import re
>>>> re.match('^(\d+)-(\d+)', "2011-07-01 09:26:11").groups()
('2011', '07')

Categories

Resources