Regular expression for extracting no. of days, months and years left - python

I'm trying to write a regular expression that extracts the number of time left for something, which can be days, months or years.
For eg, there is a sentence- "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
I know this is a funny sentence but the point is I want to extract the durations which is in the above sentence.
I have written a regular expression (?:\w*\-?\w*\s*\(\s*\d+\s*\w*\)\s*\w*|\b\d*\s+\w*\d*)\s*(?:year|month|day)s? but its not extracting 5 (five) days or the duration which have a digit followed by (words). Can anyone help with the regex ?
Thanks in advance

If you want to match the parts in the example data, you might use
\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b
The pattern matches:
\w+(?:-\w+)? Match 1+ word char with an optional - and word chars
\s* Match optional whitespace chars
(?:\(\w+(?:-\w+)?\)\s+)? Optionally match from ( till ) where there can be word chars with an optional - and word chars in between
(?:year|month|days)s? Match any of the alternatives and an optional s
\b A word boundary to prevent a partial match
See a regex demo or a Python demo
Example
import re
from pprint import pprint
regex = r"\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b"
s = "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
pprint (re.findall(regex, s))
Output
['21 days',
'21 (twenty-one) days',
'twenty-one days',
'21 months',
'21 years',
'five days',
'5 (five) days',
'five (5) days',
'five(5) days']
Note that \s could possible also match a newline, and \w can match digits and chars a-z so there can be a broad range of matches.

Related

Extract n years m months and x days pattern using python regex

I am trying to match n years m months and x days pattern using regex. n years, m months, x days and and may or may not be in the string. For exact match i am able to extract this using the regex:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', '2 years 25 days')
which returns 2 years 25 days, but if there is addtional text in the string I don't get the match like:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', 'in 2 years 25 days')
retunrs ''
I tried this:
re.search(r'.*(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?.*', 'in 2 years 25 days')
whih returns the whole string, but I dont want the additional text.
You get an empty string with the last pattern as all the parts in the regex are optional, so it will also match an empty string.
If all the parts are optional but you want to match at least 1 of them, you can use a leading assertion.
\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b
Explanation
\b A word boundary
(?=\d+ (?:years?|months?|days?)\b) Assert to the right 1+ digits and 1 of the alternatives
(?:\d+ years?)? Match 1+ digits, space and year or years
(?:\s*\d+ months?)? Same for months
(?:\s*\d+ days?)? Same for years
\b A word boundary
Regex demo | Python demo
Example
import re
pattern = r'\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b'
m = re.search(pattern, 'in 2 years 25 days')
if m:
print(m.group())
Output
2 years 25 days
Since years, months, days are temporal units, you could use the pint module for that.
Parse temporal units with pint
See the String parsing tutorial and related features used:
printing quantities, see String formatting
converting quantities, see Converting to different units
from pint import UnitRegistry
ureg = UnitRegistry()
temporal_strings = '2 years and 25 days'.split('and') # remove and split
quantities = [ureg(q) for q in temporal_strings] # parse quantities
# [<Quantity(2, 'year')>, <Quantity(25, 'day')>]
# print the quantities separately
for q in quantities:
print(q)
# get the total days
print(f"total: {sum(quantities)}")
print(f"total days: {sum(quantities).to('days')}")
Output printed:
2 year
25 day
total: 2.0684462696783026 year
total days: 755.5 day
You can try this:
import re
match =re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))', 'in 2 years 25 days')
if match:
print(match.group())
Output:
2 years 25 days

Regex: Match all characters in between an underscore and a period

I have a set of file names in which I need to extract their dates. The file names look like:
['1 120836_1_20210101.csv',
'1 120836_1_20210108.csv',
'1 120836_20210101.csv',
'1 120836_20210108.csv',
'10 120836_1_20210312.csv',
'10 120836_20210312.csv',
'11 120836_1_20210319.csv',
'11 120836_20210319.csv',
'12 120836_1_20210326.csv',
...
]
As an example, I would need to extract 20210101 from the first item in the list above.
Here is my code but it is not working - I'm not totally familiar with regex.
import re
dates = []
for file in files:
dates.extend(re.findall("(?<=_)\d{}(?=\d*\.)", file))
You weren't that far off, but there were a few issues:
you extend dates by the result of the .findall, but you only expect to find one and are constructing all of dates, so that would be a lot simpler with a re.search in a list comprehension
your regex has a few unneeded complications (and some bugs)
This is what you were after:
import re
files = [
'1 120836_1_20210101.csv',
'1 120836_1_20210108.csv',
'1 120836_20210101.csv',
'1 120836_20210108.csv',
'10 120836_1_20210312.csv',
'10 120836_20210312.csv',
'11 120836_1_20210319.csv',
'11 120836_20210319.csv',
'12 120836_1_20210326.csv'
]
dates = [re.search(r"(?<=_)\d+(?=\.)", fn).group(0) for fn in files]
print(dates)
Output:
['20210101', '20210108', '20210101', '20210108', '20210312', '20210312', '20210319', '20210319', '20210326']
It keeps the lookbehind for an underscore, and changes the lookahead to look for a period. It just matches all digits (at least one, with +) in between the two.
Note that the r in front of the string avoids having to double up the backslashes in the regex, the backslashes in \d and \. are still required to indicate a digit and a literal period.

Find date from image/text

I have dates like this and I need regex to find these types of dates
12-23-2019
29 10 2019
1:2:2018
9/04/2019
22.07.2019
here's what I did
first I removed all spaces from the text and here's what it looks like
12-23-2019291020191:02:2018
and this is my regex
re.findall(r'((\d{1,2})([.\/-])(\d{2}|\w{3,9})([.\/-])(\d{4}))',new_text)
it can find 12-23-2019 , 9/04/2019 , 22.07.2019 but cannot find 29 10 2019 and 1:02:2018
You may use
(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)
See the regex demo
Details
(?<!\d) - no digit right before
\d{1,2} - 1 or 2 digits
([.:/ -]) - a dot, colon, slash, space or hyphen (captured in Group 1)
(?:\d{1,2}|\w{3,}) - 1 or 2 digits or 3 or more word chars
\1 - same value as in Group 1
\d{4} - four digits
(?!\d) - no digit allowed right after
Python sample usage:
import re
text = 'Aaaa 12-23-2019, bddd 29 10 2019 <=== 1:2:2018'
pattern = r'(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)'
results = [x.group() for x in re.finditer(pattern, text)]
print(results) # => ['12-23-2019', '29 10 2019', '1:2:2018']

regex to match several string in Python

I have questions with regex in Python.
Possible variations can be
10 hours, 12 weeks or 7 business days.
I want to have my regex something like
string = "I have an 7 business day trip and 12 weeks vacation."
re.findall(r'\d+\s(business)?\s(hours|weeks|days)', string)
so that I expect to find "7 business day" and "12 weeks" but it returns None
string = "I have an 7 business day trip and 12 weeks vacation."
print re.findall(r'\d+\s(?:business\s)?(?:hour|week|day)s?', string)
['7 business day', '12 weeks']
\d+\s(?:business\s)?(?:hour|week|day)s?
Debuggex Demo
The demo should explain how this works. The reason yours wasn't is because it was looking for 7 businessdays which doesn't match.
Although if you don't want to accept business week/hour, you'll need to modify it further:
\d+\s(?:hour|week|(?:business )?day)s?
Debuggex Demo
You need to tweak your regex to this:
>>> string = "I have an 7 business day trip and 12 weeks vacation."
>>> print re.findall(r'(\d+)\s*(?:business day|hour|week)s?', string)
['7', '12']
This matches any number that is followed by business day or hour or week and an optional s in the end.
Similar to #anubhava's answer but matches "7 business day" rather than just "7". Just move the closing parenthesis from after \d+ to the end:
re.findall(r'(\d+\s*(?:business day|hour|week)s?)', string)
\d+\s+(business\s)?(hour|week|day)s?

python re.compile strings with vars and numbers

Hi I want to get a match for the following:
test = re.compile(r' [0-12](am|pm) [1-1000] days from (yesterday|today|tomorrow)')
with this match:
print test.match(" 3pm 2 days from today")
It returns none, what am i doing wrong? I am just getting into regex and reading the docs I thought this should work! ANY HELP APPRECIATED
chrism
--------------------------------------------------------------------------------------
I am asking a new question about the design of a sytem using similar process to above in NLP HERE
Here is my hat in the ring. Careful study of this regex will teach a few lessons:
import re
reobj = re.compile(
r"""# Loosely match a date/time reference
^ # Anchor to start of string.
\s* # Optional leading whitespace.
(?P<time> # $time: military or AM/PM time.
(?: # Group for military hours options.
[2][0-3] # Hour is either 20, 21, 22, 23,
| [01]?[0-9] # or 0-9, 00-09 or 10-19
) # End group of military hours options.
(?: # Group for optional minutes.
: # Hours and minutes separated by ":"
[0-5][0-9] # 00-59 minutes
)? # Military minutes are optional.
| # or time is given in AM/PM format.
(?:1[0-2]|0?[1-9]) # 1-12 or 01-12 AM/PM options (hour)
(?::[0-5][0-9])? # Optional minutes for AM/PM time.
\s* # Optional whitespace before AM/PM.
[ap]m # Required AM or PM (case insensitive)
) # End group of time options.
\s+ # Required whitespace.
(?P<offset> \d+ ) # $offset: count of time increments.
\s+ # Required whitespace.
(?P<units> # $units: units of time increment.
(?:sec(?:ond)?|min(ute)?|hour|day|week|month|year|decade|century)
s? # Time units may have optional plural "s".
) # End $units: units of time increment.
\s+ # Required whitespace.
(?P<dir>from|before|after|since) # #dir: Time offset direction.
\s+ # Required whitespace.
(?P<base>yesterday|today|tomorrow|(?:right )?now)
\s* # Optional whitespace before end.
$ # Anchor to end of string.""",
re.IGNORECASE | re.VERBOSE)
match = reobj.match(' 3 pm 2 days from today')
if match:
print('Time: %s' % (match.group('time')))
print('Offset: %s' % (match.group('offset')))
print('Units: %s' % (match.group('units')))
print('Direction: %s' % (match.group('dir')))
print('Base time: %s' % (match.group('base')))
else:
print("No match.")
Output:
r"""
Time: 3 pm
Offset: 2
Units: days
Direction: from
Base time: today
"""
This regex illustrates a few lessons to be learned:
Regular expressions are very powerful (and useful)!
This regex does validate the numbers, but as you can see, doing so is cumbersome and difficult (and thus, not recommended - I'm showing it here to demonstrate why not to do it this way). It is much easier to simply capture the numbers with a regex then validate the ranges using procedural code.
Named capture groups ease the pain of plucking multiple data sub-strings from larger text.
Always write regexes using free-spacing, verbose mode with proper indentation of groups and lots of descriptive comments. This helps while writing the regex and later during maintenance.
Modern regular expressions comprise a rich and powerful language. Once you learn the syntax and develop a habit of writing verbose, properly indented, well-commented code, then even complex regexes such as the one above are easy to write, easy to read and are easy to maintain. It is unfortunate that they have acquired a reputation for being difficult, unwieldy and error-prone (and thus not recommendable for complex tasks).
Happy regexing!
what about
test = re.compile(r' ([0-9]|1[012])(am|pm) \d+ days from (yesterday|today|tomorrow)')
the hours part should match 0, 1, ..., 9 or 10, 11, 12
but not 13, 14, ..., 19.
you can limit days part in similar way for 1, ..., 1000, i.e. (1000|\d{1,3}).
Try this:
test = re.compile(' \d+(am|pm) \d+ days from (yesterday|today|tomorrow)')
Try this:
import re
test = re.compile('^\s[0-1]?[0-9]{1}pm \d+ days from (today|yesterday|tomorrow)$')
print test.match(" 12pm 2 days from today")
The problem that you're having is that you can't specify multiple digit numeric ranges in regex (afaik), so you have to treat them as individual characters.
Sample here
If you want to extract the parts of the match individually, you can label the groups with (?P<name>[match]). For example:
import re
pattern = re.compile(
r'\s*(?P<time>1?[0-9])(?P<ampm>am|pm)\s+'
r'(?P<days>[1-9]\d*)\s+days\s+from\s+'
r'(?P<when>yesterday|today|tomorrow)\s*')
for time in range(0, 13):
for ampm in ('am', 'pm'):
for days in range(1, 1000):
for when in ('yesterday', 'today', 'tomorrow'):
text = ' %d%s %d days from %s ' % (time, ampm, days, when)
match = pattern.match(text)
assert match is not None
keys = sorted(match.groupdict().keys())
assert keys == ['ampm', 'days', 'time', 'when']
text = ' 3pm 2 days from today '
print pattern.match(text).groupdict()
Output:
{'time': '3', 'when': 'today', 'days': '2', 'ampm': 'pm'}
test = re.compile(' 1?\d[ap]m \d{1,3} days? from (?:yesterday|today|tomorrow)')
EDIT
Having read the discussion between Rumple Stiltskin and Demian Brecht, I noticed that my above proposition is poor because it detects a certain structure of string, but it doesn't validate precisely it is a good "time-pattern" string, because it can detect " 18pm 2 days from today" for exemple.
So I propose now a pattern that allows to detect precisely a string verifying your requirement and that points out every string having the same structure as a valid one but not with the required values of a valid good "time-pattern" string:
import re
regx = re.compile("(?<= )" # better than a blank as first character
""
"(?:(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000)"
"|"
"(\d+)([ap]m) (\d+))"
""
" days? from (yesterday|today|tomorrow)") # shared part
for ch in (" 12pm 2 days from today",
" 4pm 1 day from today",
" 12pm 0 days from today",
" 12pm 1001 days from today",
" 18pm 2 days from today",
" 1212pm 2 days from today",
" 12pm five days from today"):
print ch
mat = regx.search(ch)
if mat:
if mat.group(1):
print mat.group(1,2,3,7),'\n# time-pattern-VALIDATED string #'
else:
print mat.group(4,5,6,7),'\n* SIMILI-time-pattern STRUCTURED string*'
else:
print '- NO STRUCTURED STRING in the text -'
print
result
12pm 2 days from today
('12', 'pm', '2', 'today')
# time-pattern-VALIDATED string #
4pm 1 day from today
('4', 'pm', '1', 'today')
# time-pattern-VALIDATED string #
12pm 0 days from today
('12', 'pm', '0', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm 1001 days from today
('12', 'pm', '1001', 'today')
* SIMILI-time-pattern STRUCTURED string*
18pm 2 days from today
('18', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
1212pm 2 days from today
('1212', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm five days from today
- NO STRUCTURED STRING in the text -
If you need only a regex that detects a time-pattern validated string, you use only
regx = re.compile("(?<= )(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000) days?"
" from (yesterday|today|tomorrow)")
It is easier (and more readable) to check integer ranges after the match:
m = re.match(r' (\d+)(?:pm|am) (\d+) days from (yesterday|today|tomorrow)',
" 3pm 2 days from today")
assert m and int(m.group(1)) <= 12 and 1 <= int(m.group(2)) <= 1000
Or you could use an existing library e.g., pip install parsedatetime:
import parsedatetime.parsedatetime as pdt
cal = pdt.Calendar()
print cal.parse("3pm 2 days from today")
Output
((2011, 4, 26, 15, 0, 0, 1, 116, -1), 3)

Categories

Resources