Python regex "OR" gives empty string when using findall - python

I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]

Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]

Related

Regex | Empty String match | Python 3.4.0

My Code:
import re
#Phone Number regex
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')
Output:
[('800.420.7240', '800', '.', '420', '.', '7240', '', '', ''), ('415.863.9900', '415', '.', '863', '.', '9900', '', '', '')]
Questions:
Why are empty strings included in the match?
The empty strings are matched from what positions of the string?
What are the conditions for empty strings to be matched?
P.S.
The empty strings are not included in the match when I use the same regex on https://regex101.com/
Also, I just started learning regex a few days ago, so I'm sorry if my questions aren't good enough.
The ? operator means that it will return zero or one matches. In this case, you've made some capturing groups optional with ?, and python is returning a zero-length match for each of the three optional capturing groups you created.
If you remove the first two ? you'll eliminate some zero-length matches. To deal with the last one, you need to change the extension pattern. It accounts for two, again because you use a zero or one operator (*).
If you don't care about the internal capturing groups and just want the full match, you could filter out the zero-length matches with something like
>>> [match.group(0) for match in phoneRegex.finditer('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')]
['800.420.7240', '415.863.9900']
You could make the extension capture group match conditional on there being a match for a preceding phone number. Also, I think you may need to escape the . in the third alternative ext.. As written it matches any character, but I think what you meant was ext\..
for reference:
zero-length regex matches
python re
Why are empty strings included in the match?
Because you have used various groups in your regex. The engine captures the part of the match that you have put into a group.
The empty strings are matched from what positions of the string?
From this regex: (\s*(ext|x|ext.)\s*(\d{2,5}))?
It has three groups (you can count the opening parentheses). The engine does not find a match with extensions and the 3 groups that try to capture information return empty strings.
What are the conditions for empty strings to be matched?
If you group your regex in a way that the engine catches an empty substring in the string matched, it will return empty strings :-)
I think you are following the exercise from "automate the boring stuff with python". In the regex in VERBOSE mode on page 178, try to look for the opening parentheses. Where is the closing parentheses? The number of groups is equivalent with the number of opening parentheses. The whole regex is group number zero.
Groups are useful if you want to extract certain parts of a matched string. If you only want to extract full phone numbers, leave the groups away.
You may try just this:
phoneRegex = re.compile(r'\d{3}[\.|-|\/]\d{3}[\.|-|\/]\d{4}')
Is this what you want to achieve?
If you want to stick with your regex in VERBOSE mode, you could also use non-capturing groups. This only captures a full match:
phoneRegex = re.compile(r'''(
(?:\d{3}|\(?:\d{3}\))?
(?:\s|-|\.)? # separator
(?:\d{3}) # first 3 digits
(?:\s|-|\.) # separator
(?:\d{4}) # last 4 digits
(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # extension
)''', re.VERBOSE)

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

regex: combine capturing group with OR condition

Let's say I have a string :
s = "id_john, num847, id_000, num___"
I know how to retrieve either of 2 patterns with |:
re.findall("id_[a-z]+|num[0-9]+", s)
#### ['id_john', 'num847'] # OK
I know how to capture a portion only of a match with parenthesis:
re.findall("id_([a-z]+)", s)
#### ['john']
But I fail when i try to combine those two features, this is my desired output:
#### ['john', '847']
Thanks for your help.. (I work with python)
No need for lookaheads or complex patterns.
Consider this:
>>> re.findall('id_([a-z]+)|num([0-9]+)', s)
[('john', ''), ('', '847')]
When the first pattern matches, the first group will contain the match, and the second group will be empty. When the second pattern matches, the first group is empty, and the second group contains the match.
Since one of the two groups will always be empty, joining them couldn't hurt.
>>> [a+b for a,b in re.findall('id_([a-z]+)|num([0-9]+)', s)]
['john', '847']
You may use this code in Python with lookaheads:
>>> s = "id_john, num847, id_000, num___"
>>> print re.findall(r'(?:id_(?=[a-z]+\b)|num(?=\d+\b))([a-z\d]+)', s)
['john', '847']
RegEx Details:
(?:: Start non-capture group
id_(?=[a-z]+\b): Match id_ with a lookahead assertion to make sure we have [a-z]+ characters ahead followed by word boundary
|: OR
num(?=\d+\b))([a-z\d]+: Matchnum` with a lookahead assertion to make sure we have digits ahead followed by word boundary
): End non-capture group
([a-z\d]+): Match 1+ characters with lowercase letters or digits

Python regex, capture groups that are not specific

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.
re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $
Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.
Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Categories

Resources