I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.
Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
Code 1:
import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
print(match)
Result: 2017-07-17 00:00:00
Code 2:
import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)
Result: ValueError probably because of multiple dates in the text
How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.
More sample texts:
Hello! Your phone billed outstanding is 293.72 due date is 03rd Jul.
Bill dated 06-JUN-17 for Rs 219 is due today for your phone No. 1234567890
Bill dated 06-JUN-17 for Rs 219 is due on Jul 5 for your phone No. 1234567890
Bill dated 27-Jun-17 for your operator fixedline/broadband ID 1234567890 has been sent at abc#xyz.com from xyz#abc.com. Due amount: Rs 3,764.53, due date: 16-Jul-17.
Details of bill dated 21-JUN-2017 for phone no. 1234567890: Total Due: Rs 374.12, Due Date: 09-JUL-2017, Bill Delivery Date: 25-Jun-2017,
Greetings! Bill for your mobile 1234567890, dtd 18-Jun-17, payment due date 06-Jul-17 has been sent on abc#xyz.com
Dear customer, your phone bill of Rs.191.24 was due on 25-Jun-2017
Hi! Your phone bill for Rs. 560.41 is due on 03-07-2017.
An idea for using dateutil.parser:
from dateutil.parser import parse
for s in sms_text.split():
try:
print(parse(s))
except ValueError:
pass
There are two things that prevent datefinder to parse correctly your samples:
the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')
The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.
def extract_duedate(text):
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).
Obviously, this is a raw example function.
Here are the results I get:
1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
There is one problem on the sample 2: 'today' is not recognized alone by datefinder
Example:
>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]
So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:
def extract_duedate(text):
if 'today' in text:
text = text.replace('today', datetime.date.today().isoformat())
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Now the results are good for all samples:
1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
If you need, you can let the function return all dates and they should all be correct.
Why not just using regex? If your input strings always contain this substrings due on ... has been you can just do something like that:
import re
from datetime import datetime
string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
sent to your regd email ID abc#xyz.com. Pls check Inbox"""
match_obj = re.search(r'due on (.*) has been', string)
if match_obj:
date_str = match_obj.group(1)
else:
print "No match!!"
try:
# DD-MM-YYYY
print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
# try another format
try:
print datetime.strptime(date_str, "%Y-%m-%d")
except ValueError:
try:
print datetime.strptime(date_str, "%m-%d")
except ValueError:
...
Having a text message as the example you have provided:
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.
import re
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
due_date = re.split('due on', re.split('has been', sms_text)[0])[1]
print(due_date)
Resulting: 15-07-2017
With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.
Related
I am trying to extract dates from text using dateutil library (Python 3.7)
I want to extract all dates from text using code below.
import dateutil.parser as dparser
text = 'First date is 10 JANUARY 2000 and second date is 31/3/2000'
dt = dparser.parse(text, fuzzy=True, dayfirst=True, default=datetime.datetime(1800, 1, 1))
But getting following exception
Unknown string format: First date is 10 JANUARY 2000 and second date
is 31/1/2000
Please let me know any way to extract multiple dates in the text.
How about using datefinder?
import datefinder
string_with_dates = '''
First date is 10 JANUARY 2000 and second date is 31/3/2000
'''
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
Output:
2000-01-10 00:00:00
2000-03-31 00:00:00
I hope someone can help me with the following:
I'm trying to convert my data to daily averages using:
df['timestamp'] = pd.to_datetime(df['Datum WSM-09'])
df_daily_avg = df.groupby(pd.Grouper(freq='D', key='timestamp')).mean()
df['Datum WSM-09'] looks like this:
0 6-3-2020 12:30
1 6-3-2020 12:40
2 6-3-2020 12:50
3 6-3-2020 13:00
4 6-3-2020 13:10
...
106785 18-3-2022 02:00
106786 18-3-2022 02:10
106787 18-3-2022 02:20
106788 18-3-2022 02:30
106789 18-3-2022 02:40
Name: Datum WSM-09, Length: 106790, dtype: object
However, when executing the first line the data under "timestamp" is inconsistent. The last rows displayed in the picture are correct. For the first ones, it should be 2020-03-06 12:30. The month and the day are switched around.
Many thanks
Try using the "dayfirst" option:
df['timestamp'] = pd.to_datetime(df['Datum WSM-09'], dayfirst=True)
In https://xkcd.com/1179 Randall Munroe explains
that "you're doing it Wrong."
Your source column is apparently object / text.
The March 18th timestamps are unambiguous,
as there's fewer than 18 months in the year.
The ambiguous March 6th timestamps make the hair
on the back of the black cat stand on end.
You neglected to specify a timestamp format,
given that the source column is ambiguously formatted.
Please RTFM:
format : str, default None
The strftime to parse time, e.g. "%d/%m/%Y". Note that "%f" will parse all the way up to nanoseconds. See strftime documentation for more information on choices.
You tried offering a value of None,
which is not a good match to your business needs.
I don't know what all of your input data looks like,
but perhaps %d-%m-%Y %H:%M would better
match your needs.
I want to parse date each time the same way. I wrote a code that detects date in string, but it parses the date differently. My date is 3rd of November 2020, and it mixes the days and months:
import datefinder
import time
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
matches = datefinder.find_dates(sample)
matches = list(matches)
print(matches)
print(matches[0].strftime("%Y-%m-%d"))
print()
How can I fix this?
I want to parse date each time the same way, regardless of the format in string.
can you show me how can I do that (i do not care what lib should i use)?
As I previously stated in your other question about extracting dates, there is no universal date extraction that can handle every date/time format. Date cleaning in your case will require a multi-pronged approach, which I have outlined in the examples below:
Updated 12-06-2020
I'm going to assume that some of the dates within sample_dates are from the German language websites that you're scraping. So this code below can parse those dates.
Please review the parameters for dateutil.parser.parse.
import datefinder
import re as regex
from datetime import datetime
import dateutil.parser as dparser
sample_dates = ["this is my sample date 2020.11.03 yes yes",
"this is my sample date 2020-11-03 yes yes",
"this is my sample date 2020/11/03 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03.11.2020 yes yes",
"this is my sample date 03/11/2020 yes yes"]
for sample in sample_dates:
check_year_first = regex.search(r'\d{4}\W\d{1,2}\W\d{1,2}', sample)
if check_year_first:
date_strings = datefinder.find_dates(sample)
for date_string in date_strings:
reformatted_date = datetime.strptime(str(date_string), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
print(reformatted_date)
else:
date_strings = datefinder.find_dates(sample, source=True)
for date_string in date_strings:
reformatted_date = dparser.parse(str(date_string[1]).replace('date', ''), dayfirst=True).strftime('%Y-%m-%d')
print(reformatted_date)
I'm also going to assume that the dates in this question are being produced from the output of your question Extract date from multiple webpages with Python. If that is correct then add this code to the other code that I provided for that question. Like I stated before my other code can be modified to fit your data extraction and cleaning requirements.
import dateparser
from datetime import datetime
def reformatted_dates(date_string):
# date format 3. Nov. 2020
date_format_01 = regex.search(r'\d{1,2}\W\s\w+\W\s\d{4}', date_string)
# date format 3. Dezember 2020
date_format_02 = regex.search(r'\d{1,2}\W\s\w+\s\d{4}', date_string)
if date_format_01:
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
elif date_format_02:
pass
reformatted_date = dateparser.parse(date_string).strftime('%m.%d.%Y')
return reformatted_date
else:
# date format 18.11.2020
reformatted_date = datetime.strptime(date_string, '%d.%m.%Y').strftime('%m-%d-%Y')
return reformatted_date
I want to check if the format of the date input by user matches the below:
Jan 5 2018 6:10 PM
Month: First letter should be caps, followed 2 more in small. (total 3 letters)
<Space>: single space, must exist
Date: For single digit it should not be 05, but 5
<Space>: single space, must exist
Hour: 0-12, for single digit it should not be 06, but 6
Minute: 00-59
AM/PM
I'm using the below regex and trying to match:
import re,sys
usr_date = str(input("Please enter the older date until which you want to scan ? \n[Date Format Example: Jan 5 2018 6:10 PM] : "))
valid_usr_date = re.search("^(\s+)*[A-Z]{1}[a-z]{2}\s{1}[1-31]{1}\s{1}[1-2]{1}[0-9]{1}[0-9]{1}[0-9]{1}\s{1}[0-12]{1}:[0-5]{1}[0-9]{1}\s{1}(A|P)M$",usr_date,re.M)
if not valid_usr_date:
print ("The date format is incorrect. Please follow the exact date format as shown in example. Exiting Program!")
sys.exit()
But, even for the correct format it gives a syntax wrong error. What am I doing wrong.
I would not use regex for that, as you have no way to actually validate the date itself (eg, a regex will happily accept Abc 99 9876 9:99 PM).
Instead, use strptime:
from datetime import datetime
string = 'Jan 5 2018 6:10 PM'
datetime.strptime(string, '%b %d %Y %I:%M %p')
If the string would be in the "wrong" format you'd get a ValueError.
The only apparent "problem" with this approach is that for some reason you require the day and hour not to be zero-padded and strptime doesn't seem to have such directives.
A table with all available directives is here.
You could use a function which parses the input string and tries to return a datetime object, if it can't it raises an ValueError:
from datetime import datetime
def valid_date(s):
try:
return datetime.strptime(s, '%Y-%m-%d %H:%M')
except ValueError:
msg = "Not a valid date: '{0}'.".format(s)
raise argparse.ArgumentTypeError(msg)
I am using Python 2.7.
I have an Adobe PDF form doc that has a date field. I extract the values using the pdfminer function. The problem I need to solve is, the user in Adobe Acrobat reader is allowed to type in strings like april 3rd 2017 or 3rd April 2017 or Apr 3rd 2017 or 04/04/2017 as well as 4 3 2017. Now the date field in Adobe is set to mm/dd/yyyy format, so when a user types in one of the values above, that is the actual value that pdfminer pulls, yet adobe will display it as 04/03/2017, but when you click on the field is shows you the actual value like the ones above. Adobe allows this and then doing it's on conversion I think to display the date as mm/dd/yyyy. There is ability to use javascript with adobe for more control, but i can't do that the users can only have and use the pdf form without any accompanying javascript file.
So I was looking to find a method with datetime in Python that would be able to accept a written date such as the examples above from a string and then convert them into a true mm/dd/yyyy format??? I saw methods for converting long and short month names but nothing that would handle day names like 1st,2nd,3rd,4th .
You could just try each possible format in turn. First remove any st nd rd specifiers to make the testing easier:
from datetime import datetime
formats = ["%B %d %Y", "%d %B %Y", "%b %d %Y", "%m/%d/%Y", "%m %d %Y"]
dates = ["april 3rd 2017", "3rd April 2017", "Apr 3rd 2017", "04/04/2017", "4 3 2017"]
for date in dates:
date = date.lower().replace("rd", "").replace("nd", "").replace("st", "")
for format in formats:
try:
print datetime.strptime(date, format).strftime("%m/%d/%Y")
except ValueError:
pass
Which would display:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
This approach has the benefit of validating each date. For example a month greater than 12. You could flag any dates that failed all allowed formats.
Just write a regular expression to get the number out of the string.
import re
s = '30Apr'
n = s[:re.match(r'[0-9]+', s).span()[1]]
print(n) # Will print 30
The other things should be easy.
Based on #MartinEvans's anwser, but using arrow library: (because it handles more cases than datetime so you don't have to use replace() nor lower())
First install arrow:
pip install arrow
Then try each possible format:
import arrow
dates = ['april 3rd 2017', '3rd April 2017', 'Apr 3rd 2017', '04/04/2017', '4 3 2017']
formats = ['MMMM Do YYYY', 'Do MMMM YYYY', 'MMM Do YYYY', 'MM/DD/YYYY', 'M D YYYY']
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
except arrow.parser.ParserError:
pass
[convert_datetime(date) for date in dates]
Will output:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
If you are unsure of what could be wrong in your date format, you can also output a nice error message if none of the date matches the format:
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
break
except (arrow.parser.ParserError, ValueError) as e:
pass
else:
print 'For date: "{0}", {1}'.format(date, e)
convert_datetime('124 5 2017') # test invalid date
Will output the following error message:
'For date: "124 5 2017", month must be in 1..12'