Python regex: How to match a number with a non-number? - python

I want split number with another character.
Example
Input:
we spend 100year
Output:
we speed 100 year
Input:
today i'm200 pound
Output
today i'm 200 pound
Input:
he maybe have212cm
Output:
he maybe have 212 cm
I tried re.sub(r'(?<=\S)\d', ' \d', string) and re.sub(r'\d(?=\S)', '\d ', string), which doesn't work.

This will do it:
ins='''\
we spend 100year
today i'm200 pound
he maybe have212cm'''
for line in ins.splitlines():
line=re.sub(r'\s*(\d+)\s*',r' \1 ', line)
print line
Prints:
we spend 100 year
today i'm 200 pound
he maybe have 212 cm
Same syntax for multiple matches in the same line of text:
>>> re.sub(r'\s*(\d+)\s*',r' \1 ', "we spend 100year + today i'm200 pound")
"we spend 100 year + today i'm 200 pound"
The capturing groups (generally) are numbered left to right and the \number refers to each numbered group in the match:
>>> re.sub(r'(\d)(\d)(\d)',r'\2\3\1','567')
'675'
If it is easier to read, you can name your capturing groups rather than using the \1 \2 notation:
>>> line="we spend 100year today i'm200 pound"
>>> re.sub(r'\s*(?P<nums>\d+)\s*',r' \g<nums> ',line)
"we spend 100 year today i'm 200 pound"

This takes care of one case:
>>> re.sub(r'([a-zA-Z])(?=\d)',r'\1 ',s)
'he maybe have 212cm'
And this takes care of the other:
>>> re.sub(r'(?<=\d)([a-zA-Z])',r' \1',s)
'he maybe have212 cm'
Hopefully someone with more regex experience than me can figure out how to combine them ...

Related

Dealing with comma and fullstops as per convention

I have various instance of strings such as:
- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000
For dealing with first case, I came up with two step regex:
re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to
But this breaks the text in some different ways and other cases are not covered as well. Can anyone suggest a more general regex that solves all cases properly. I've tried using replace, but it does some unintended replacements which in turn raise some other problems. I'm not an expert in regex, would appreciate pointers.
This approach covers your cases above by breaking the text into tokens:
in_list = [
'hello world,i am 2000to',
'the state was 56,869,12th',
'covering.2%',
'fiji,295,000',
'and another example with a decimal 12.3not4,5 is right out',
'parrot,, is100.00% dead'
'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
print(' '.join(re.findall(pattern=tokenizer, string=s)))
# hello world, i am 2000 to
# the state was 56,869, 12th
# covering. 2%
# fiji, 295,000
# and another example with a decimal 12.3 not 4, 5 is right out
# parrot, is 100.00% dead
# Holy grail runs for this portion of 100 minutes, 91%. Fascinating
Breaking up the regex, each token is the longest available substring with:
Only letters with or without a period or comma,[a-zA-Z]+[\.,]?
OR |
A number-ish expression which could be
1 to 3 digits \d{1,3} followed by any number of groups of comma + 3 digits (?:,\d{3})+
OR | any number of comma-free digits \d+
optionally a decimal place followed by at least one digit (?:\.\d+),
optionally a suffix (percent, 'st', 'nd', 'rd', 'th') (?:[\.,%]|st|nd|rd|th)?
optionally period or comma [\.]?
Note the (?:blah) is used to suppress re.findall's natural desire to tell you how every parenthesized group matches up on an individual basis. In this case we just want it to walk forward through the string, and the ?: accomplishes this.

regex to match several string in Python

I have questions with regex in Python.
Possible variations can be
10 hours, 12 weeks or 7 business days.
I want to have my regex something like
string = "I have an 7 business day trip and 12 weeks vacation."
re.findall(r'\d+\s(business)?\s(hours|weeks|days)', string)
so that I expect to find "7 business day" and "12 weeks" but it returns None
string = "I have an 7 business day trip and 12 weeks vacation."
print re.findall(r'\d+\s(?:business\s)?(?:hour|week|day)s?', string)
['7 business day', '12 weeks']
\d+\s(?:business\s)?(?:hour|week|day)s?
Debuggex Demo
The demo should explain how this works. The reason yours wasn't is because it was looking for 7 businessdays which doesn't match.
Although if you don't want to accept business week/hour, you'll need to modify it further:
\d+\s(?:hour|week|(?:business )?day)s?
Debuggex Demo
You need to tweak your regex to this:
>>> string = "I have an 7 business day trip and 12 weeks vacation."
>>> print re.findall(r'(\d+)\s*(?:business day|hour|week)s?', string)
['7', '12']
This matches any number that is followed by business day or hour or week and an optional s in the end.
Similar to #anubhava's answer but matches "7 business day" rather than just "7". Just move the closing parenthesis from after \d+ to the end:
re.findall(r'(\d+\s*(?:business day|hour|week)s?)', string)
\d+\s+(business\s)?(hour|week|day)s?

Match to second regular expression if first has no matches

I'm attempting to extract the text between HTML tags using regex in python. The catch is that sometimes there are no HTML tags in the string, so I want my regex to match the entire string. So far, I've got the part that matches the inner text of the tag:
(?<=>).*(?=<\/)
This would match to Russia in the tag below
<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>
Alternately, the entire string would be matched:
Typhoon Vongfong prompted ANA to cancel 101 flights, affecting about 16,600 passengers, the airline said in a faxed statement. Japan Airlines halted 31 flights today and three tomorrow, it said by fax. The storm turned northeast after crossing Okinawa, Japan’s southernmost prefecture, with winds gusting to 75 knots (140 kilometers per hour), according to the U.S. Navy’s Joint Typhoon Warning Center.
Otherwise I want it to return all the text in the string.
I've read a bit about regex conditionals online, but I can't seem to get them to work. If anyone can point me in the right direction, that would be great. Thanks in advance.
You could do this with a single regex. You don't need to go for any workaround.
>>> import re
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['This is Russia Today']
Here is a work-around. Instead of adjusting the regex, we adjust the string:
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['This is Russia Today']

Extracting sub-string after the first space in Python

I need help in regex or Python to extract a substring from a set of string. The string consists of alphanumeric. I just want the substring that starts after the first space and ends before the last space like the example given below.
Example 1:
A:01 What is the date of the election ?
BK:02 How long is the river Nile ?
Results:
What is the date of the election
How long is the river Nile
While I am at it, is there an easy way to extract strings before or after a certain character? For example, I want to extract the date or day like from a string like the ones given in Example 2.
Example 2:
Date:30/4/2013
Day:Tuesday
Results:
30/4/2013
Tuesday
I have actually read about regex but it's very alien to me. Thanks.
I recommend using split
>>> s="A:01 What is the date of the election ?"
>>> " ".join(s.split()[1:-1])
'What is the date of the election'
>>> s="BK:02 How long is the river Nile ?"
>>> " ".join(s.split()[1:-1])
'How long is the river Nile'
>>> s="Date:30/4/2013"
>>> s.split(":")[1:][0]
'30/4/2013'
>>> s="Day:Tuesday"
>>> s.split(":")[1:][0]
'Tuesday'
>>> s="A:01 What is the date of the election ?"
>>> s.split(" ", 1)[1].rsplit(" ", 1)[0]
'What is the date of the election'
>>>
There's no need to dig into regex if this is all you need; you can use str.partition
s = "A:01 What is the date of the election ?"
before,sep,after = s.partition(' ') # could be, eg, a ':' instead
If all you want is the last part, you can use _ as a placeholder for 'don't care':
_,_,theReallyAwesomeDay = s.partition(':')

python re.compile strings with vars and numbers

Hi I want to get a match for the following:
test = re.compile(r' [0-12](am|pm) [1-1000] days from (yesterday|today|tomorrow)')
with this match:
print test.match(" 3pm 2 days from today")
It returns none, what am i doing wrong? I am just getting into regex and reading the docs I thought this should work! ANY HELP APPRECIATED
chrism
--------------------------------------------------------------------------------------
I am asking a new question about the design of a sytem using similar process to above in NLP HERE
Here is my hat in the ring. Careful study of this regex will teach a few lessons:
import re
reobj = re.compile(
r"""# Loosely match a date/time reference
^ # Anchor to start of string.
\s* # Optional leading whitespace.
(?P<time> # $time: military or AM/PM time.
(?: # Group for military hours options.
[2][0-3] # Hour is either 20, 21, 22, 23,
| [01]?[0-9] # or 0-9, 00-09 or 10-19
) # End group of military hours options.
(?: # Group for optional minutes.
: # Hours and minutes separated by ":"
[0-5][0-9] # 00-59 minutes
)? # Military minutes are optional.
| # or time is given in AM/PM format.
(?:1[0-2]|0?[1-9]) # 1-12 or 01-12 AM/PM options (hour)
(?::[0-5][0-9])? # Optional minutes for AM/PM time.
\s* # Optional whitespace before AM/PM.
[ap]m # Required AM or PM (case insensitive)
) # End group of time options.
\s+ # Required whitespace.
(?P<offset> \d+ ) # $offset: count of time increments.
\s+ # Required whitespace.
(?P<units> # $units: units of time increment.
(?:sec(?:ond)?|min(ute)?|hour|day|week|month|year|decade|century)
s? # Time units may have optional plural "s".
) # End $units: units of time increment.
\s+ # Required whitespace.
(?P<dir>from|before|after|since) # #dir: Time offset direction.
\s+ # Required whitespace.
(?P<base>yesterday|today|tomorrow|(?:right )?now)
\s* # Optional whitespace before end.
$ # Anchor to end of string.""",
re.IGNORECASE | re.VERBOSE)
match = reobj.match(' 3 pm 2 days from today')
if match:
print('Time: %s' % (match.group('time')))
print('Offset: %s' % (match.group('offset')))
print('Units: %s' % (match.group('units')))
print('Direction: %s' % (match.group('dir')))
print('Base time: %s' % (match.group('base')))
else:
print("No match.")
Output:
r"""
Time: 3 pm
Offset: 2
Units: days
Direction: from
Base time: today
"""
This regex illustrates a few lessons to be learned:
Regular expressions are very powerful (and useful)!
This regex does validate the numbers, but as you can see, doing so is cumbersome and difficult (and thus, not recommended - I'm showing it here to demonstrate why not to do it this way). It is much easier to simply capture the numbers with a regex then validate the ranges using procedural code.
Named capture groups ease the pain of plucking multiple data sub-strings from larger text.
Always write regexes using free-spacing, verbose mode with proper indentation of groups and lots of descriptive comments. This helps while writing the regex and later during maintenance.
Modern regular expressions comprise a rich and powerful language. Once you learn the syntax and develop a habit of writing verbose, properly indented, well-commented code, then even complex regexes such as the one above are easy to write, easy to read and are easy to maintain. It is unfortunate that they have acquired a reputation for being difficult, unwieldy and error-prone (and thus not recommendable for complex tasks).
Happy regexing!
what about
test = re.compile(r' ([0-9]|1[012])(am|pm) \d+ days from (yesterday|today|tomorrow)')
the hours part should match 0, 1, ..., 9 or 10, 11, 12
but not 13, 14, ..., 19.
you can limit days part in similar way for 1, ..., 1000, i.e. (1000|\d{1,3}).
Try this:
test = re.compile(' \d+(am|pm) \d+ days from (yesterday|today|tomorrow)')
Try this:
import re
test = re.compile('^\s[0-1]?[0-9]{1}pm \d+ days from (today|yesterday|tomorrow)$')
print test.match(" 12pm 2 days from today")
The problem that you're having is that you can't specify multiple digit numeric ranges in regex (afaik), so you have to treat them as individual characters.
Sample here
If you want to extract the parts of the match individually, you can label the groups with (?P<name>[match]). For example:
import re
pattern = re.compile(
r'\s*(?P<time>1?[0-9])(?P<ampm>am|pm)\s+'
r'(?P<days>[1-9]\d*)\s+days\s+from\s+'
r'(?P<when>yesterday|today|tomorrow)\s*')
for time in range(0, 13):
for ampm in ('am', 'pm'):
for days in range(1, 1000):
for when in ('yesterday', 'today', 'tomorrow'):
text = ' %d%s %d days from %s ' % (time, ampm, days, when)
match = pattern.match(text)
assert match is not None
keys = sorted(match.groupdict().keys())
assert keys == ['ampm', 'days', 'time', 'when']
text = ' 3pm 2 days from today '
print pattern.match(text).groupdict()
Output:
{'time': '3', 'when': 'today', 'days': '2', 'ampm': 'pm'}
test = re.compile(' 1?\d[ap]m \d{1,3} days? from (?:yesterday|today|tomorrow)')
EDIT
Having read the discussion between Rumple Stiltskin and Demian Brecht, I noticed that my above proposition is poor because it detects a certain structure of string, but it doesn't validate precisely it is a good "time-pattern" string, because it can detect " 18pm 2 days from today" for exemple.
So I propose now a pattern that allows to detect precisely a string verifying your requirement and that points out every string having the same structure as a valid one but not with the required values of a valid good "time-pattern" string:
import re
regx = re.compile("(?<= )" # better than a blank as first character
""
"(?:(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000)"
"|"
"(\d+)([ap]m) (\d+))"
""
" days? from (yesterday|today|tomorrow)") # shared part
for ch in (" 12pm 2 days from today",
" 4pm 1 day from today",
" 12pm 0 days from today",
" 12pm 1001 days from today",
" 18pm 2 days from today",
" 1212pm 2 days from today",
" 12pm five days from today"):
print ch
mat = regx.search(ch)
if mat:
if mat.group(1):
print mat.group(1,2,3,7),'\n# time-pattern-VALIDATED string #'
else:
print mat.group(4,5,6,7),'\n* SIMILI-time-pattern STRUCTURED string*'
else:
print '- NO STRUCTURED STRING in the text -'
print
result
12pm 2 days from today
('12', 'pm', '2', 'today')
# time-pattern-VALIDATED string #
4pm 1 day from today
('4', 'pm', '1', 'today')
# time-pattern-VALIDATED string #
12pm 0 days from today
('12', 'pm', '0', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm 1001 days from today
('12', 'pm', '1001', 'today')
* SIMILI-time-pattern STRUCTURED string*
18pm 2 days from today
('18', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
1212pm 2 days from today
('1212', 'pm', '2', 'today')
* SIMILI-time-pattern STRUCTURED string*
12pm five days from today
- NO STRUCTURED STRING in the text -
If you need only a regex that detects a time-pattern validated string, you use only
regx = re.compile("(?<= )(1[012]|\d)([ap]m) (?!0 )(\d{1,3}|1000) days?"
" from (yesterday|today|tomorrow)")
It is easier (and more readable) to check integer ranges after the match:
m = re.match(r' (\d+)(?:pm|am) (\d+) days from (yesterday|today|tomorrow)',
" 3pm 2 days from today")
assert m and int(m.group(1)) <= 12 and 1 <= int(m.group(2)) <= 1000
Or you could use an existing library e.g., pip install parsedatetime:
import parsedatetime.parsedatetime as pdt
cal = pdt.Calendar()
print cal.parse("3pm 2 days from today")
Output
((2011, 4, 26, 15, 0, 0, 1, 116, -1), 3)

Categories

Resources