python re.compile strings with vars and numbers - python

Hi I want to get a match for the following:
test = re.compile(r' [0-12](am|pm) [1-1000] days from (yesterday|today|tomorrow)')
with this match:
print test.match(" 3pm 2 days from today")
It returns none, what am i doing wrong? I am just getting into regex and reading the docs I thought this should work! ANY HELP APPRECIATED
chrism
--------------------------------------------------------------------------------------
I am asking a new question about the design of a sytem using similar process to above in NLP HERE

Here is my hat in the ring. Careful study of this regex will teach a few lessons:
import re
reobj = re.compile(
r"""# Loosely match a date/time reference
^ # Anchor to start of string.
\s* # Optional leading whitespace.
(?P<time> # $time: military or AM/PM time.
(?: # Group for military hours options.
[2][0-3] # Hour is either 20, 21, 22, 23,
| [01]?[0-9] # or 0-9, 00-09 or 10-19
) # End group of military hours options.
(?: # Group for optional minutes.
: # Hours and minutes separated by ":"
[0-5][0-9] # 00-59 minutes
)? # Military minutes are optional.
| # or time is given in AM/PM format.
(?:1[0-2]|0?[1-9]) # 1-12 or 01-12 AM/PM options (hour)
(?::[0-5][0-9])? # Optional minutes for AM/PM time.
\s* # Optional whitespace before AM/PM.
[ap]m # Required AM or PM (case insensitive)
) # End group of time options.
\s+ # Required whitespace.
(?P<offset> \d+ ) # $offset: count of time increments.
\s+ # Required whitespace.
(?P<units> # $units: units of time increment.
(?:sec(?:ond)?|min(ute)?|hour|day|week|month|year|decade|century)
s? # Time units may have optional plural "s".
) # End $units: units of time increment.
\s+ # Required whitespace.
(?P<dir>from|before|after|since) # #dir: Time offset direction.
\s+ # Required whitespace.
(?P<base>yesterday|today|tomorrow|(?:right )?now)
\s* # Optional whitespace before end.
$ # Anchor to end of string.""",
re.IGNORECASE | re.VERBOSE)
match = reobj.match(' 3 pm 2 days from today')
if match:
print('Time: %s' % (match.group('time')))
print('Offset: %s' % (match.group('offset')))
print('Units: %s' % (match.group('units')))
print('Direction: %s' % (match.group('dir')))
print('Base time: %s' % (match.group('base')))
else:
print("No match.")
Output:
r"""
Time: 3 pm
Offset: 2
Units: days
Direction: from
Base time: today
"""
This regex illustrates a few lessons to be learned:
Regular expressions are very powerful (and useful)!
This regex does validate the numbers, but as you can see, doing so is cumbersome and difficult (and thus, not recommended - I'm showing it here to demonstrate why not to do it this way). It is much easier to simply capture the numbers with a regex then validate the ranges using procedural code.
Named capture groups ease the pain of plucking multiple data sub-strings from larger text.
Always write regexes using free-spacing, verbose mode with proper indentation of groups and lots of descriptive comments. This helps while writing the regex and later during maintenance.
Modern regular expressions comprise a rich and powerful language. Once you learn the syntax and develop a habit of writing verbose, properly indented, well-commented code, then even complex regexes such as the one above are easy to write, easy to read and are easy to maintain. It is unfortunate that they have acquired a reputation for being difficult, unwieldy and error-prone (and thus not recommendable for complex tasks).
Happy regexing!

what about
test = re.compile(r' ([0-9]|1[012])(am|pm) \d+ days from (yesterday|today|tomorrow)')
the hours part should match 0, 1, ..., 9 or 10, 11, 12
but not 13, 14, ..., 19.
you can limit days part in similar way for 1, ..., 1000, i.e. (1000|\d{1,3}).

Try this:
test = re.compile(' \d+(am|pm) \d+ days from (yesterday|today|tomorrow)')

Try this:
import re
test = re.compile('^\s[0-1]?[0-9]{1}pm \d+ days from (today|yesterday|tomorrow)$')
print test.match(" 12pm 2 days from today")
The problem that you're having is that you can't specify multiple digit numeric ranges in regex (afaik), so you have to treat them as individual characters.
Sample here

If you want to extract the parts of the match individually, you can label the groups with (?P<name>[match]). For example:
import re
pattern = re.compile(
r'\s*(?P<time>1?[0-9])(?P<ampm>am|pm)\s+'
r'(?P<days>[1-9]\d*)\s+days\s+from\s+'
r'(?P<when>yesterday|today|tomorrow)\s*')
for time in range(0, 13):
for ampm in ('am', 'pm'):
for days in range(1, 1000):
for when in ('yesterday', 'today', 'tomorrow'):
text = ' %d%s %d days from %s ' % (time, ampm, days, when)
match = pattern.match(text)
assert match is not None
keys = sorted(match.groupdict().keys())
assert keys == ['ampm', 'days', 'time', 'when']
text = ' 3pm 2 days from today '
print pattern.match(text).groupdict()
Output:
{'time': '3', 'when': 'today', 'days': '2', 'ampm': 'pm'}

test = re.compile(' 1?\d[ap]m \d{1,3} days? from (?:yesterday|today|tomorrow)')
EDIT
Having read the discussion between Rumple Stiltskin and Demian Brecht, I noticed that my above proposition is poor because it detects a certain structure of string, but it doesn't validate precisely it is a good "time-pattern" string, because it can detect " 18pm 2 days from today" for exemple.
So I propose now a pattern that allows to detect precisely a string verifying your requirement and that points out every string having the same structure as a valid one but not with the required values of a valid good "time-pattern" string:
import re
regx = re.compile("(?<= )" # better than a blank as first character
""
"(?:(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000)"
"|"
"(\d+)([ap]m) (\d+))"
""
" days? from (yesterday|today|tomorrow)") # shared part
for ch in (" 12pm 2 days from today",
" 4pm 1 day from today",
" 12pm 0 days from today",
" 12pm 1001 days from today",
" 18pm 2 days from today",
" 1212pm 2 days from today",
" 12pm five days from today"):
print ch
mat = regx.search(ch)
if mat:
if mat.group(1):
print mat.group(1,2,3,7),'\n# time-pattern-VALIDATED string #'
else:
print mat.group(4,5,6,7),'\n* SIMILI-time-pattern STRUCTURED string*'
else:
print '- NO STRUCTURED STRING in the text -'
print
result
12pm 2 days from today
('12', 'pm', '2', 'today')
# time-pattern-VALIDATED string #
4pm 1 day from today
('4', 'pm', '1', 'today')
# time-pattern-VALIDATED string #
12pm 0 days from today
('12', 'pm', '0', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm 1001 days from today
('12', 'pm', '1001', 'today')
* SIMILI-time-pattern STRUCTURED string*
18pm 2 days from today
('18', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
1212pm 2 days from today
('1212', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm five days from today
- NO STRUCTURED STRING in the text -
If you need only a regex that detects a time-pattern validated string, you use only
regx = re.compile("(?<= )(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000) days?"
" from (yesterday|today|tomorrow)")

It is easier (and more readable) to check integer ranges after the match:
m = re.match(r' (\d+)(?:pm|am) (\d+) days from (yesterday|today|tomorrow)',
" 3pm 2 days from today")
assert m and int(m.group(1)) <= 12 and 1 <= int(m.group(2)) <= 1000
Or you could use an existing library e.g., pip install parsedatetime:
import parsedatetime.parsedatetime as pdt
cal = pdt.Calendar()
print cal.parse("3pm 2 days from today")
Output
((2011, 4, 26, 15, 0, 0, 1, 116, -1), 3)

Related

How to convert strings with UTC-# at the end to a DateTimeField in Python?

I have a list of strings formatted as follows: '2/24/2021 3:37:04 PM UTC-6'
How would I convert this?
I have tried
datetime.strptime(my_date_object, '%m/%d/%Y %I:%M:%s %p %Z')
but I get an error saying "unconverted data remains: -6"
Is this because of the UTC-6 at the end?
The approach that #MrFuppes mentioned in their comment is the easiest way to do this.
Ok seems like you need to split the string on 'UTC' and parse the offset separately. You can then set the tzinfo from a timedelta
input_string = '2/24/2021 3:37:04 PM UTC-6'
try:
dtm_string, utc_offset = input_string.split("UTC", maxsplit=1)
except ValueError:
# Couldn't split the string, so no "UTC" in the string
print("Warning! No timezone!")
dtm_string = input_string
utc_offset = "0"
dtm_string = dtm_string.strip() # Remove leading/trailing whitespace '2/24/2021 3:37:04 PM'
utc_offset = int(utc_offset) # Convert utc offset to integer -6
tzinfo = tzinfo = datetime.timezone(datetime.timedelta(hours=utc_offset))
result_datetime = datetime.datetime.strptime(dtm_string, '%m/%d/%Y %I:%M:%S %p').replace(tzinfo=tzinfo)
print(result_datetime)
# prints 2021-02-24 15:37:04-06:00
Alternatively, you can avoid using datetime.strptime if you extract the relevant components pretty easily with regular expressions
rex = r"(\d{1,2})\/(\d{1,2})\/(\d{4}) (\d{1,2}):(\d{2}):(\d{2}) (AM|PM) UTC(\+|-)(\d{1,2})"
input_string = '2/24/2021 3:37:04 PM UTC-6'
r = re.findall(rex, input_string)
# gives: [('2', '24', '2021', '3', '37', '04', 'PM', '-', '6')]
mm = int(r[0][0])
dd = int(r[0][1])
yy = int(r[0][2])
hrs = int(r[0][3])
mins = int(r[0][4])
secs = int(r[0][5])
if r[0][6].upper() == "PM":
hrs = hrs + 12
tzoffset = int(f"{r[0][7]}{r[0][8]}")
tzinfo = datetime.timezone(datetime.timedelta(hours=tzoffset))
result_datetime = datetime.datetime(yy, mm, dd, hrs, mins, secs, tzinfo=tzinfo)
print(result_datetime)
# prints 2021-02-24 15:37:04-06:00
The regular expression (\d{1,2})\/(\d{1,2})\/(\d{4}) (\d{1,2}):(\d{2}):(\d{2}) (AM|PM) UTC(\+|-)(\d{1,2})
Demo
Explanation:
(\d{1,2}): One or two digits. Surrounding parentheses indicate that this is a capturing group. A similar construct is used to get the month, date and hours, and UTC offset
\/: A forward slash
(\d{4}): Exactly four digits. Also a capturing group. A similar construct is used for minutes and seconds.
(AM|PM): Either "AM" or "PM"
UTC(\+|-)(\d{1,2}): "UTC", followed by a plus or minus sign, followed by one or two digits.

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

regex to match several string in Python

I have questions with regex in Python.
Possible variations can be
10 hours, 12 weeks or 7 business days.
I want to have my regex something like
string = "I have an 7 business day trip and 12 weeks vacation."
re.findall(r'\d+\s(business)?\s(hours|weeks|days)', string)
so that I expect to find "7 business day" and "12 weeks" but it returns None
string = "I have an 7 business day trip and 12 weeks vacation."
print re.findall(r'\d+\s(?:business\s)?(?:hour|week|day)s?', string)
['7 business day', '12 weeks']
\d+\s(?:business\s)?(?:hour|week|day)s?
Debuggex Demo
The demo should explain how this works. The reason yours wasn't is because it was looking for 7 businessdays which doesn't match.
Although if you don't want to accept business week/hour, you'll need to modify it further:
\d+\s(?:hour|week|(?:business )?day)s?
Debuggex Demo
You need to tweak your regex to this:
>>> string = "I have an 7 business day trip and 12 weeks vacation."
>>> print re.findall(r'(\d+)\s*(?:business day|hour|week)s?', string)
['7', '12']
This matches any number that is followed by business day or hour or week and an optional s in the end.
Similar to #anubhava's answer but matches "7 business day" rather than just "7". Just move the closing parenthesis from after \d+ to the end:
re.findall(r'(\d+\s*(?:business day|hour|week)s?)', string)
\d+\s+(business\s)?(hour|week|day)s?

Group naming with group and nested regex (unit conversion from text file)

Basic Question:
How can you name a python regex group with another group value and nest this within a larger regex group?
Origin of Question:
Given a string such as 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'
What is an elegant solution for extracting the times and converted to a given unit?
Attempted Solution:
My best guess at a solution would be to create a dictionary and then perform operations on the dictionary to convert to the desired unit.
i.e. convert the given string to this:
string[0]:
{'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}
string[1]:
{'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}
I have a regex solution, but it isn't doing what I would like:
import re
test_string = ['Your favorite song is 1 hour 23 seconds long. My phone only records for 1h 30 mins and 10 secs.',
'This video is 4 days 2h 3min 6sec 30ms']
year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))
# pattern = r"""(?P<time> # time group beginning
# (?P<value>[\d]+) # value of time unit
# \s* # may or may not be space between digit and unit
# (?P<unit>%s) # unit measurement of time
# \s* # may or may not be space between digit and unit
# )
# \w+""" % all_units
pattern = r""".*(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
).* # may be words in between the times
""" % (all_units)
regex = re.compile(pattern)
for val in test_string:
match = regex.search(val)
print(match)
print(match.groupdict())
This fails miserably due to not being able to properly deal with nested groupings and not being able to assign a name with the value of a group.
First of all, you can't just write a multiline regex with comments and expect it to match anything if you don't use the re.VERBOSE flag:
regex = re.compile(pattern, re.VERBOSE)
Like you said, the best solution is probably to use a dict
for val in test_string:
while True: #find all times
match = regex.search(val) #find the first unit
if not match:
break
matches= {} # keep track of all units and their values
while True:
matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
m= regex.search(val)
if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
break
match= m
print matches # the finished dict
# output will be like {'h': 1, 'secs': 10, 'mins': 30}
However, the code above won't work just yet. We need to make two adjustments:
The pattern cannot allow just any text between matches. To allow only whitespace and the word "and" between two matches, you can use
pattern = r"""(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
(?:\band\s+)? # allow the word "and" between numbers
) # may be words in between the times
""" % (all_units)
You have to change the order of your units like so:
year_units = ['years', 'year', 'y'] # yearS before year
day_units = ['days', 'day', 'd'] # dayS before day, etc...
Why? Because if you have a text like 3 years and 1 day, then it would match 3 year instead of 3 years and.

Python regex: How to match a number with a non-number?

I want split number with another character.
Example
Input:
we spend 100year
Output:
we speed 100 year
Input:
today i'm200 pound
Output
today i'm 200 pound
Input:
he maybe have212cm
Output:
he maybe have 212 cm
I tried re.sub(r'(?<=\S)\d', ' \d', string) and re.sub(r'\d(?=\S)', '\d ', string), which doesn't work.
This will do it:
ins='''\
we spend 100year
today i'm200 pound
he maybe have212cm'''
for line in ins.splitlines():
line=re.sub(r'\s*(\d+)\s*',r' \1 ', line)
print line
Prints:
we spend 100 year
today i'm 200 pound
he maybe have 212 cm
Same syntax for multiple matches in the same line of text:
>>> re.sub(r'\s*(\d+)\s*',r' \1 ', "we spend 100year + today i'm200 pound")
"we spend 100 year + today i'm 200 pound"
The capturing groups (generally) are numbered left to right and the \number refers to each numbered group in the match:
>>> re.sub(r'(\d)(\d)(\d)',r'\2\3\1','567')
'675'
If it is easier to read, you can name your capturing groups rather than using the \1 \2 notation:
>>> line="we spend 100year today i'm200 pound"
>>> re.sub(r'\s*(?P<nums>\d+)\s*',r' \g<nums> ',line)
"we spend 100 year today i'm 200 pound"
This takes care of one case:
>>> re.sub(r'([a-zA-Z])(?=\d)',r'\1 ',s)
'he maybe have 212cm'
And this takes care of the other:
>>> re.sub(r'(?<=\d)([a-zA-Z])',r' \1',s)
'he maybe have212 cm'
Hopefully someone with more regex experience than me can figure out how to combine them ...

Categories

Resources