I have a regex which detects the date of birth in the given paragraph text.
import re
dob = re.compile(r'(?:\bbirth\b|\bbirth(?:day|date)).{0,20}\n? \b((?:(?<!\:)(?<!\:\d)[0-3]?\d(?:st|nd|rd|th)?\s+(?:of\s+)?(?:jan\.?|january|feb\.?|february|mar\.?|march|apr\.?|april|may|jun\.?|june|jul\.?|july|aug\.?|august|sep\.?|september|oct\.?|october|nov\.?|november|dec\.?|december)|(?:jan\.?|january|feb\.?|february|mar\.?|march|apr\.?|april|may|jun\.?|june|jul\.?|july|aug\.?|august|sep\.?|september|oct\.?|october|nov\.?|november|dec\.?|december)\s+(?<!\:)(?<!\:\d)[0-3]?\d(?:st|nd|rd|th)?)(?:\,)?\s*(?:\d{4})?|\b[0-3]?\d[-\./][0-3]?\d[-\./]\d{2,4})\b',re.IGNORECASE | re.MULTILINE)
data = " Hi This is Goku and my birthday is on 6th Aug but to be clear it is on 1994-08-06."
l = dob.findall(data)
print(l)
o/p: ['6th Aug ']
I just want to add one more feature like if something in this format YYYY-MM-DD is present in the text, then that should also be the date of birth.
(where YYYY --> 19XX-20XX , MM --> 01-12 , DD --> 01-31)
For Ex:
data = " Hi This is Goku and my birthday is on 6th Aug but to be clear it is on 1994-08-06."
Then the output should be
output: ['6th Aug ', '1994-08-06']
where can i add the part in the regex so it would detect this YYYY-MM-DD format also.??
this will detect YYYY-MM-DD
re.search('([0-9]+-)+[0-9]+',data).group()
output:
'1994-08-06'
Related
I have tried a lot. Banning words doesn't help, removing certain characters doesn't help.
The datetime module doesn't have a directive for this. It has things like %d which will give you today's day, for example 24.
I have a date in the format of 'Tuesday 24th January' but I need it to be 'Tuesday 24 January'.
Is there a way to remove st,nd,rd,th. Or is there an even better way?
EDIT: even removing rd would remove it from Saturday. So that doesn't work either.
You can use a regex:
import re
d = 'Tuesday 24th January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d) # \1 to restore the captured day
print(d)
# Output
Tuesday 24 January
For Saturday 21st January:
d = 'Saturday 21st January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d)
print(d)
# Output
Saturday 21 January
I want to parse date each time the same way. I wrote a code that detects date in string, but it parses the date differently. My date is 3rd of November 2020, and it mixes the days and months:
import datefinder
import time
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
matches = datefinder.find_dates(sample)
matches = list(matches)
print(matches)
print(matches[0].strftime("%Y-%m-%d"))
print()
How can I fix this?
I want to parse date each time the same way, regardless of the format in string.
can you show me how can I do that (i do not care what lib should i use)?
As I previously stated in your other question about extracting dates, there is no universal date extraction that can handle every date/time format. Date cleaning in your case will require a multi-pronged approach, which I have outlined in the examples below:
Updated 12-06-2020
I'm going to assume that some of the dates within sample_dates are from the German language websites that you're scraping. So this code below can parse those dates.
Please review the parameters for dateutil.parser.parse.
import datefinder
import re as regex
from datetime import datetime
import dateutil.parser as dparser
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
check_year_first = regex.search(r'\d{4}\W\d{1,2}\W\d{1,2}', sample)
if check_year_first:
date_strings = datefinder.find_dates(sample)
for date_string in date_strings:
reformatted_date = datetime.strptime(str(date_string), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
print(reformatted_date)
else:
date_strings = datefinder.find_dates(sample, source=True)
for date_string in date_strings:
reformatted_date = dparser.parse(str(date_string[1]).replace('date', ''), dayfirst=True).strftime('%Y-%m-%d')
print(reformatted_date)
I'm also going to assume that the dates in this question are being produced from the output of your question Extract date from multiple webpages with Python. If that is correct then add this code to the other code that I provided for that question. Like I stated before my other code can be modified to fit your data extraction and cleaning requirements.
import dateparser
from datetime import datetime
def reformatted_dates(date_string):
# date format 3. Nov. 2020
date_format_01 = regex.search(r'\d{1,2}\W\s\w+\W\s\d{4}', date_string)
# date format 3. Dezember 2020
date_format_02 = regex.search(r'\d{1,2}\W\s\w+\s\d{4}', date_string)
if date_format_01:
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
elif date_format_02:
pass
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
else:
# date format 18.11.2020
reformatted_date = datetime.strptime(date_string, '%d.%m.%Y').strftime('%m-%d-%Y')
return reformatted_date
I have a date string like Thursday, December 13, 2018 i.e., DAY, MONTH dd, yyyy and I need to validate it with a regular expression.
The regex should not validate incorrect day or month. For example, Muesday, December 13, 2018 and Thursday, December 32, 2018 should be marked invalid.
What I could do so far is write expressions for the ", ", "dd", and "yyyy". I don't understand how will I customize the regex in such a way that it would accept only correct day's and month's name.
My attempt:
^([something would come over here for day name]day)([\,]|[\, ])(something would come over here for month name)(0?[1-9]|[12][0-9]|3[01])([\,]|[\, ])([12][0-9]\d\d)$
Thanks.
EDIT: I have only included years starting from year 1000 - year 2999. Validating leap years does not matter.
You can try a library that implements regex for "complex" case like yours. This is called datefinder.
This guy made the work for you to find any kind of date into texts:
https://github.com/akoumjian/datefinder
To install : pip install datefinder
import datefinder
string_with_dates = "entries are due by January 4th, 2017 at 8:00pm
created 01/15/2005 by ACME Inc. and associates."
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
# Output
2017-01-04 20:00:00
2005-01-15 00:00:00
To detect wrong words like "Muesday" you you filter your text with an spellchecker like PyEnchant
import enchant
>>> d = enchant.Dict("en_US")
>>> print(d.check("Monday"))
True
>>> print(d.check("Muesday"))
False
>>> print(d.suggest("Muesday"))
['Tuesday', 'Domesday', 'Muesli', 'Wednesday', 'Mesdames']
regex is not the way to go to solve your problem!
But here is some example code where you could see how something would come over here for day name-section in your pattern could be written. I also added example of how to use strptime() that is a much better solution in your case:
import re
from datetime import datetime
s = """
Thursday, December 13, 2018
Muesday, December 13, 2018
Monday, January 13, 2018
Thursday, December 32, 2018
"""
pat = r"""
^
(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\
([\,]|[\, ])\
(January|February|March|April|May|June|July|August|September|October|November|December)\
(0?[1-9]|[12][0-9]|3[01])
([\,]|[\, ])\
([12][0-9]\d\d)
$
"""
for match in re.finditer(pat, s, re.VERBOSE+re.MULTILINE):
print match
for row in s.split('\n'):
try:
match = datetime.strptime(row, '%A, %B %d, %Y')
print match
except:
print "'%s' is not valid"%row
I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.
Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
Code 1:
import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
print(match)
Result: 2017-07-17 00:00:00
Code 2:
import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)
Result: ValueError probably because of multiple dates in the text
How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.
More sample texts:
Hello! Your phone billed outstanding is 293.72 due date is 03rd Jul.
Bill dated 06-JUN-17 for Rs 219 is due today for your phone No. 1234567890
Bill dated 06-JUN-17 for Rs 219 is due on Jul 5 for your phone No. 1234567890
Bill dated 27-Jun-17 for your operator fixedline/broadband ID 1234567890 has been sent at abc#xyz.com from xyz#abc.com. Due amount: Rs 3,764.53, due date: 16-Jul-17.
Details of bill dated 21-JUN-2017 for phone no. 1234567890: Total Due: Rs 374.12, Due Date: 09-JUL-2017, Bill Delivery Date: 25-Jun-2017,
Greetings! Bill for your mobile 1234567890, dtd 18-Jun-17, payment due date 06-Jul-17 has been sent on abc#xyz.com
Dear customer, your phone bill of Rs.191.24 was due on 25-Jun-2017
Hi! Your phone bill for Rs. 560.41 is due on 03-07-2017.
An idea for using dateutil.parser:
from dateutil.parser import parse
for s in sms_text.split():
try:
print(parse(s))
except ValueError:
pass
There are two things that prevent datefinder to parse correctly your samples:
the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')
The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.
def extract_duedate(text):
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).
Obviously, this is a raw example function.
Here are the results I get:
1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
There is one problem on the sample 2: 'today' is not recognized alone by datefinder
Example:
>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]
So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:
def extract_duedate(text):
if 'today' in text:
text = text.replace('today', datetime.date.today().isoformat())
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Now the results are good for all samples:
1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
If you need, you can let the function return all dates and they should all be correct.
Why not just using regex? If your input strings always contain this substrings due on ... has been you can just do something like that:
import re
from datetime import datetime
string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
sent to your regd email ID abc#xyz.com. Pls check Inbox"""
match_obj = re.search(r'due on (.*) has been', string)
if match_obj:
date_str = match_obj.group(1)
else:
print "No match!!"
try:
# DD-MM-YYYY
print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
# try another format
try:
print datetime.strptime(date_str, "%Y-%m-%d")
except ValueError:
try:
print datetime.strptime(date_str, "%m-%d")
except ValueError:
...
Having a text message as the example you have provided:
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.
import re
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
due_date = re.split('due on', re.split('has been', sms_text)[0])[1]
print(due_date)
Resulting: 15-07-2017
With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.
I have a string as
fmt_string2 = I want to apply for leaves from 12/12/2017 to 12/18/2017
Here I want to extract the following dates. But my code needs to be robust as this can be in any format it can be 12 January 2017 or 12 Jan 17. and its position can also change.
For the above code I have tried doing:
''.join(fmt_string2.split()[-1].split('.')[::-10])
But here I am giving position of my date. Which I dont want.
Can anyone help in making a robust code for extracting dates.
If 12/12/2017, 12 January 2017, and 12 Jan 17 are the only possible patterns then the following code that uses regex should be enough.
import re
string = 'I want to apply for leaves from 12/12/2017 to 12/18/2017 I want to apply for leaves from 12 January 2017 to ' \
'12/18/2017 I want to apply for leaves from 12/12/2017 to 12 Jan 17 '
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', string)
for match in matches:
print(match[0])
Output:
12/12/2017
12/18/2017
12 January 2017
12/18/2017
12/12/2017
12 Jan 17
To understand the regex play with it hare in regex101.
Using Regular Expressions
Rather than going through regex completely, I suggest the following approach:
import re
from dateutil.parser import parse
Sample Text
text = """
I want to apply for leaves from 12/12/2017 to 12/18/2017
then later from 12 January 2018 to 18 January 2018
then lastly from 12 Feb 2018 to 18 Feb 2018
"""
Regular expression to find anything that is of form "from A to B". Advantage here will be that I don't have to take care of each and every case and keep building my regex. Rather this is dynamic.
pattern = re.compile(r'from (.*) to (.*)')
matches = re.findall(pattern, text)
Pattern from above regex for the text is
[('12/12/2017', '12/18/2017'), ('12 January 2018', '18 January 2018'), ('12 Feb 2018', '18 Feb 2018')]
For each match I parse the date. Exception is thrown for value that isn't date, hence in except block we pass.
for val in matches:
try:
dt_from = parse(val[0])
dt_to = parse(val[1])
print("Leave applied from", dt_from.strftime('%d/%b/%Y'), "to", dt_to.strftime('%d/%b/%Y'))
except ValueError:
print("skipping", val)
Output:
Leave applied from 12/Dec/2017 to 18/Dec/2017
Leave applied from 12/Jan/2018 to 18/Jan/2018
Leave applied from 12/Feb/2018 to 18/Feb/2018
Using pyparsing
Using regular expressions has the limitation that it might end up being very complex in order to make it more dynamic for handling not so straightforward input for e.g.
text = """
I want to apply for leaves from start 12/12/2017 to end date 12/18/2017 some random text
then later from 12 January 2018 to 18 January 2018 some random text
then lastly from 12 Feb 2018 to 18 Feb 2018 some random text
"""
So, Pyton's pyparsing module is the best fit here.
import pyparsing as pp
Here approach is to create a dictionary that can parse the entire text.
Create keywords for month names that can be used as pyparsing keyword
months_list= []
for month_idx in range(1, 13):
months_list.append(calendar.month_name[month_idx])
months_list.append(calendar.month_abbr[month_idx])
# join the list to use it as pyparsing keyword
month_keywords = " ".join(months_list)
Dictionary for parsing:
# date separator - can be one of '/', '.', or ' '
separator = pp.Word("/. ")
# Dictionary for numeric date e.g. 12/12/2018
numeric_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=4))
# Dictionary for text date e.g. 12/Jan/2018
text_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.oneOf(month_keywords) + separator + pp.Word(pp.nums, max=4))
# Either numeric or text date
date_pattern = numeric_date | text_date
# Final dictionary - from x to y
pattern = pp.Suppress(pp.SkipTo("from") + pp.Word("from") + pp.Optional("start") + pp.Optional("date")) + date_pattern
pattern += pp.Suppress(pp.Word("to") + pp.Optional("end") + pp.Optional("date")) + date_pattern
# Group the pattern, also it can be multiple
pattern = pp.OneOrMore(pp.Group(pattern))
Parse the input text:
result = pattern.parseString(text)
# Print result
for match in result:
print("from", match[0], "to", match[1])
Output:
from 12/12/2017 to 12/18/2017
from 12 January 2018 to 18 January 2018
from 12 Feb 2018 to 18 Feb 2018