Extract n years m months and x days pattern using python regex - python

I am trying to match n years m months and x days pattern using regex. n years, m months, x days and and may or may not be in the string. For exact match i am able to extract this using the regex:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', '2 years 25 days')
which returns 2 years 25 days, but if there is addtional text in the string I don't get the match like:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', 'in 2 years 25 days')
retunrs ''
I tried this:
re.search(r'.*(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?.*', 'in 2 years 25 days')
whih returns the whole string, but I dont want the additional text.

You get an empty string with the last pattern as all the parts in the regex are optional, so it will also match an empty string.
If all the parts are optional but you want to match at least 1 of them, you can use a leading assertion.
\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b
Explanation
\b A word boundary
(?=\d+ (?:years?|months?|days?)\b) Assert to the right 1+ digits and 1 of the alternatives
(?:\d+ years?)? Match 1+ digits, space and year or years
(?:\s*\d+ months?)? Same for months
(?:\s*\d+ days?)? Same for years
\b A word boundary
Regex demo | Python demo
Example
import re
pattern = r'\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b'
m = re.search(pattern, 'in 2 years 25 days')
if m:
print(m.group())
Output
2 years 25 days

Since years, months, days are temporal units, you could use the pint module for that.
Parse temporal units with pint
See the String parsing tutorial and related features used:
printing quantities, see String formatting
converting quantities, see Converting to different units
from pint import UnitRegistry
ureg = UnitRegistry()
temporal_strings = '2 years and 25 days'.split('and') # remove and split
quantities = [ureg(q) for q in temporal_strings] # parse quantities
# [<Quantity(2, 'year')>, <Quantity(25, 'day')>]
# print the quantities separately
for q in quantities:
print(q)
# get the total days
print(f"total: {sum(quantities)}")
print(f"total days: {sum(quantities).to('days')}")
Output printed:
2 year
25 day
total: 2.0684462696783026 year
total days: 755.5 day

You can try this:
import re
match =re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))', 'in 2 years 25 days')
if match:
print(match.group())
Output:
2 years 25 days

Related

Expand currency sign with moving words inside string

How can I achieve this without using if operator and huge multiline constructions?
Example:
around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day
Convert to
around 98 million dollars last week
above 10 million dollars next month
about 5 billion euros past year
after 1 billion pounds this day
You probably have more cases than this so the regular expression may be too specific if you have more cases, but re.sub can be used with a function to process each match and make the correct replacement. Below solves for the cases provided:
import re
text = '''\
around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day
'''
currency_name = {'$':'dollars', '€':'euros', '£':'pounds'}
def replacement(match):
# group 2 is the digits and million/billon,
# tack on the currency type afterward using a lookup dictionary
return f'{match.group(2)} {currency_name[match.group(1)]}'
# capture the symbol followed by digits and million/billion
print(re.sub(r'([$€£])(\d+ [mb]illion)\b', replacement, text))
Output:
around 98 million dollars last week
above 10 million dollars next month
about 5 billion euros past year
after 1 billion pounds this day
You could make a dictionary mapping currency symbol to name, and then generate the regexes. Note that these regexes will only work for something in the form of a number and then a word.
import re
CURRENCIES = {
r"\$": "dollars", # Note the slash; $ is a special regex character
"€": "euros",
"£": "pounds",
}
REGEXES = []
for symbol, name in CURRENCIES.items():
REGEXES.append((re.compile(rf"{symbol}(\d+ [^\W\d_]+)"), rf"\1 {name}"))
text = """around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day"""
for regex, replacement in REGEXES:
text = regex.sub(replacement, text)
It's useful in a case like this to remember that re.sub can accept a lambda rather than just a string.
The following requires Python 3.8+ for the := operator.
s = "about €5 billion past year"
re.sub(r'([€$])(\d+)\s+([mb]illion)',
lambda m: f"{(g := m.groups())[1]} {g[2]} {'euros' if g[0] == '€' else 'dollars'}",
s)
# 'about 5 billion euros past year'

Regular expression for extracting no. of days, months and years left

I'm trying to write a regular expression that extracts the number of time left for something, which can be days, months or years.
For eg, there is a sentence- "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
I know this is a funny sentence but the point is I want to extract the durations which is in the above sentence.
I have written a regular expression (?:\w*\-?\w*\s*\(\s*\d+\s*\w*\)\s*\w*|\b\d*\s+\w*\d*)\s*(?:year|month|day)s? but its not extracting 5 (five) days or the duration which have a digit followed by (words). Can anyone help with the regex ?
Thanks in advance
If you want to match the parts in the example data, you might use
\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b
The pattern matches:
\w+(?:-\w+)? Match 1+ word char with an optional - and word chars
\s* Match optional whitespace chars
(?:\(\w+(?:-\w+)?\)\s+)? Optionally match from ( till ) where there can be word chars with an optional - and word chars in between
(?:year|month|days)s? Match any of the alternatives and an optional s
\b A word boundary to prevent a partial match
See a regex demo or a Python demo
Example
import re
from pprint import pprint
regex = r"\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b"
s = "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
pprint (re.findall(regex, s))
Output
['21 days',
'21 (twenty-one) days',
'twenty-one days',
'21 months',
'21 years',
'five days',
'5 (five) days',
'five (5) days',
'five(5) days']
Note that \s could possible also match a newline, and \w can match digits and chars a-z so there can be a broad range of matches.

Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.

regex to match several string in Python

I have questions with regex in Python.
Possible variations can be
10 hours, 12 weeks or 7 business days.
I want to have my regex something like
string = "I have an 7 business day trip and 12 weeks vacation."
re.findall(r'\d+\s(business)?\s(hours|weeks|days)', string)
so that I expect to find "7 business day" and "12 weeks" but it returns None
string = "I have an 7 business day trip and 12 weeks vacation."
print re.findall(r'\d+\s(?:business\s)?(?:hour|week|day)s?', string)
['7 business day', '12 weeks']
\d+\s(?:business\s)?(?:hour|week|day)s?
Debuggex Demo
The demo should explain how this works. The reason yours wasn't is because it was looking for 7 businessdays which doesn't match.
Although if you don't want to accept business week/hour, you'll need to modify it further:
\d+\s(?:hour|week|(?:business )?day)s?
Debuggex Demo
You need to tweak your regex to this:
>>> string = "I have an 7 business day trip and 12 weeks vacation."
>>> print re.findall(r'(\d+)\s*(?:business day|hour|week)s?', string)
['7', '12']
This matches any number that is followed by business day or hour or week and an optional s in the end.
Similar to #anubhava's answer but matches "7 business day" rather than just "7". Just move the closing parenthesis from after \d+ to the end:
re.findall(r'(\d+\s*(?:business day|hour|week)s?)', string)
\d+\s+(business\s)?(hour|week|day)s?

python re.compile strings with vars and numbers

Hi I want to get a match for the following:
test = re.compile(r' [0-12](am|pm) [1-1000] days from (yesterday|today|tomorrow)')
with this match:
print test.match(" 3pm 2 days from today")
It returns none, what am i doing wrong? I am just getting into regex and reading the docs I thought this should work! ANY HELP APPRECIATED
chrism
--------------------------------------------------------------------------------------
I am asking a new question about the design of a sytem using similar process to above in NLP HERE
Here is my hat in the ring. Careful study of this regex will teach a few lessons:
import re
reobj = re.compile(
r"""# Loosely match a date/time reference
^ # Anchor to start of string.
\s* # Optional leading whitespace.
(?P<time> # $time: military or AM/PM time.
(?: # Group for military hours options.
[2][0-3] # Hour is either 20, 21, 22, 23,
| [01]?[0-9] # or 0-9, 00-09 or 10-19
) # End group of military hours options.
(?: # Group for optional minutes.
: # Hours and minutes separated by ":"
[0-5][0-9] # 00-59 minutes
)? # Military minutes are optional.
| # or time is given in AM/PM format.
(?:1[0-2]|0?[1-9]) # 1-12 or 01-12 AM/PM options (hour)
(?::[0-5][0-9])? # Optional minutes for AM/PM time.
\s* # Optional whitespace before AM/PM.
[ap]m # Required AM or PM (case insensitive)
) # End group of time options.
\s+ # Required whitespace.
(?P<offset> \d+ ) # $offset: count of time increments.
\s+ # Required whitespace.
(?P<units> # $units: units of time increment.
(?:sec(?:ond)?|min(ute)?|hour|day|week|month|year|decade|century)
s? # Time units may have optional plural "s".
) # End $units: units of time increment.
\s+ # Required whitespace.
(?P<dir>from|before|after|since) # #dir: Time offset direction.
\s+ # Required whitespace.
(?P<base>yesterday|today|tomorrow|(?:right )?now)
\s* # Optional whitespace before end.
$ # Anchor to end of string.""",
re.IGNORECASE | re.VERBOSE)
match = reobj.match(' 3 pm 2 days from today')
if match:
print('Time: %s' % (match.group('time')))
print('Offset: %s' % (match.group('offset')))
print('Units: %s' % (match.group('units')))
print('Direction: %s' % (match.group('dir')))
print('Base time: %s' % (match.group('base')))
else:
print("No match.")
Output:
r"""
Time: 3 pm
Offset: 2
Units: days
Direction: from
Base time: today
"""
This regex illustrates a few lessons to be learned:
Regular expressions are very powerful (and useful)!
This regex does validate the numbers, but as you can see, doing so is cumbersome and difficult (and thus, not recommended - I'm showing it here to demonstrate why not to do it this way). It is much easier to simply capture the numbers with a regex then validate the ranges using procedural code.
Named capture groups ease the pain of plucking multiple data sub-strings from larger text.
Always write regexes using free-spacing, verbose mode with proper indentation of groups and lots of descriptive comments. This helps while writing the regex and later during maintenance.
Modern regular expressions comprise a rich and powerful language. Once you learn the syntax and develop a habit of writing verbose, properly indented, well-commented code, then even complex regexes such as the one above are easy to write, easy to read and are easy to maintain. It is unfortunate that they have acquired a reputation for being difficult, unwieldy and error-prone (and thus not recommendable for complex tasks).
Happy regexing!
what about
test = re.compile(r' ([0-9]|1[012])(am|pm) \d+ days from (yesterday|today|tomorrow)')
the hours part should match 0, 1, ..., 9 or 10, 11, 12
but not 13, 14, ..., 19.
you can limit days part in similar way for 1, ..., 1000, i.e. (1000|\d{1,3}).
Try this:
test = re.compile(' \d+(am|pm) \d+ days from (yesterday|today|tomorrow)')
Try this:
import re
test = re.compile('^\s[0-1]?[0-9]{1}pm \d+ days from (today|yesterday|tomorrow)$')
print test.match(" 12pm 2 days from today")
The problem that you're having is that you can't specify multiple digit numeric ranges in regex (afaik), so you have to treat them as individual characters.
Sample here
If you want to extract the parts of the match individually, you can label the groups with (?P<name>[match]). For example:
import re
pattern = re.compile(
r'\s*(?P<time>1?[0-9])(?P<ampm>am|pm)\s+'
r'(?P<days>[1-9]\d*)\s+days\s+from\s+'
r'(?P<when>yesterday|today|tomorrow)\s*')
for time in range(0, 13):
for ampm in ('am', 'pm'):
for days in range(1, 1000):
for when in ('yesterday', 'today', 'tomorrow'):
text = ' %d%s %d days from %s ' % (time, ampm, days, when)
match = pattern.match(text)
assert match is not None
keys = sorted(match.groupdict().keys())
assert keys == ['ampm', 'days', 'time', 'when']
text = ' 3pm 2 days from today '
print pattern.match(text).groupdict()
Output:
{'time': '3', 'when': 'today', 'days': '2', 'ampm': 'pm'}
test = re.compile(' 1?\d[ap]m \d{1,3} days? from (?:yesterday|today|tomorrow)')
EDIT
Having read the discussion between Rumple Stiltskin and Demian Brecht, I noticed that my above proposition is poor because it detects a certain structure of string, but it doesn't validate precisely it is a good "time-pattern" string, because it can detect " 18pm 2 days from today" for exemple.
So I propose now a pattern that allows to detect precisely a string verifying your requirement and that points out every string having the same structure as a valid one but not with the required values of a valid good "time-pattern" string:
import re
regx = re.compile("(?<= )" # better than a blank as first character
""
"(?:(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000)"
"|"
"(\d+)([ap]m) (\d+))"
""
" days? from (yesterday|today|tomorrow)") # shared part
for ch in (" 12pm 2 days from today",
" 4pm 1 day from today",
" 12pm 0 days from today",
" 12pm 1001 days from today",
" 18pm 2 days from today",
" 1212pm 2 days from today",
" 12pm five days from today"):
print ch
mat = regx.search(ch)
if mat:
if mat.group(1):
print mat.group(1,2,3,7),'\n# time-pattern-VALIDATED string #'
else:
print mat.group(4,5,6,7),'\n* SIMILI-time-pattern STRUCTURED string*'
else:
print '- NO STRUCTURED STRING in the text -'
print
result
12pm 2 days from today
('12', 'pm', '2', 'today')
# time-pattern-VALIDATED string #
4pm 1 day from today
('4', 'pm', '1', 'today')
# time-pattern-VALIDATED string #
12pm 0 days from today
('12', 'pm', '0', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm 1001 days from today
('12', 'pm', '1001', 'today')
* SIMILI-time-pattern STRUCTURED string*
18pm 2 days from today
('18', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
1212pm 2 days from today
('1212', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm five days from today
- NO STRUCTURED STRING in the text -
If you need only a regex that detects a time-pattern validated string, you use only
regx = re.compile("(?<= )(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000) days?"
" from (yesterday|today|tomorrow)")
It is easier (and more readable) to check integer ranges after the match:
m = re.match(r' (\d+)(?:pm|am) (\d+) days from (yesterday|today|tomorrow)',
" 3pm 2 days from today")
assert m and int(m.group(1)) <= 12 and 1 <= int(m.group(2)) <= 1000
Or you could use an existing library e.g., pip install parsedatetime:
import parsedatetime.parsedatetime as pdt
cal = pdt.Calendar()
print cal.parse("3pm 2 days from today")
Output
((2011, 4, 26, 15, 0, 0, 1, 116, -1), 3)

Categories

Resources