python re.split() doubtful result

python re.split() doubtful result - python

I'm a python beginner.I'm having doubt about the output about re.split()
text='alpha, beta,,,gamma dela'
In [9]: re.split('(,)+',text)
Out[9]: ['alpha', ',', ' beta', ',', 'gamma dela']
In [11]: re.split('(,+)',text)
Out[11]: ['alpha', ',', ' beta', ',,,', 'gamma dela']
In [7]: re.split('[,]+',text)
Out[7]: ['alpha', ' beta', 'gamma dela']
why these output are different?
please help me ,thank you very much!

As is specified in the documentation of re.split:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern
are also returned as part of the resulting list. If maxsplit is
nonzero, at most maxsplit splits occur, and the remainder of the
string is returned as the final element of the list.
A capture group is usually described using parenthesis ((..)) that do not contain ?: or lookahead/lookbehind marker. So the first two regexes have capture groups:
(,)+
# ^ ^
(,+)
# ^ ^
In the first case the capture group is a single comma. So that means the last capture is used (a single comma thus). In the second case ((,+)) it can capture multiple commas (and a regex aims to capture as much as possible, so it captures here all).
In the last case, there is no capture group, so this means splitting is done and the text matched against the pattern is completely ignored.

Related

What's the regular expression pattern that matches the part of a comma-separated number starting from leftmost comma to the last whole number digit

Given a comma delimited number "123,456,789" as a string, i am attempting build a regular expression pattern that matches from (includes) the left-most comma ',' to the last whole number (unit place value) digit '9'. For the number in the above string, ",456,789" should be matched.
My code goes as followes:
import re
print(re.findall(r"(,\d{3})*", "123,456,789"))
# The above regular expression pattern is actually part of a much larger
# regular expression pattern to match a number that may or may not be
# comma delimited or be in scientific notation. The pattern is:
# r"([-+]?\d+){1}(,\d{3})*(\.\d+)?([Ee][+-]?([-+]?\d+){1}(,\d{3})*)?"
The above code however produces a logic error where only the minimal (non-greedy) right-most match is returned. The output is as follows:
In [0]: print(re.findall(r"(,\d{3})*", "123,456")) # Expected output: ',456'
Out[0]: [',456', '']
In [1]: print(re.findall(r"(,\d{3})*", "123,456,789")) # Expected output: ',456,789'
Out[1]: [',789', '']
In [2]: print(re.findall(r"(,\d{3})*", "123,456,789,000")) # Expected output: ',456,789,000'
Out[2]: [',000', '']
Please help me identify my mistake.

Use regex start of string \A to find the first match only.
number = '123,456,789'
all_after_first_comma = re.sub('\A\d{1,3},', ',', number)
to get ',456,789'

You can simply add a ?: to your pattern to suppress the subgroups, making the pattern (?:,\d{3})*:
import re
for result in filter(None, re.findall("(?:,\d{3})*", "123,456,789")):
print(result)
Output:
,456,789
The filter is there to filter out empty strings.

Regex | Empty String match | Python 3.4.0

My Code:
import re
#Phone Number regex
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')
Output:
[('800.420.7240', '800', '.', '420', '.', '7240', '', '', ''), ('415.863.9900', '415', '.', '863', '.', '9900', '', '', '')]
Questions:
Why are empty strings included in the match?
The empty strings are matched from what positions of the string?
What are the conditions for empty strings to be matched?
P.S.
The empty strings are not included in the match when I use the same regex on https://regex101.com/
Also, I just started learning regex a few days ago, so I'm sorry if my questions aren't good enough.

The ? operator means that it will return zero or one matches. In this case, you've made some capturing groups optional with ?, and python is returning a zero-length match for each of the three optional capturing groups you created.
If you remove the first two ? you'll eliminate some zero-length matches. To deal with the last one, you need to change the extension pattern. It accounts for two, again because you use a zero or one operator (*).
If you don't care about the internal capturing groups and just want the full match, you could filter out the zero-length matches with something like
>>> [match.group(0) for match in phoneRegex.finditer('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')]
['800.420.7240', '415.863.9900']
You could make the extension capture group match conditional on there being a match for a preceding phone number. Also, I think you may need to escape the . in the third alternative ext.. As written it matches any character, but I think what you meant was ext\..
for reference:
zero-length regex matches
python re

Why are empty strings included in the match?
Because you have used various groups in your regex. The engine captures the part of the match that you have put into a group.
The empty strings are matched from what positions of the string?
From this regex: (\s*(ext|x|ext.)\s*(\d{2,5}))?
It has three groups (you can count the opening parentheses). The engine does not find a match with extensions and the 3 groups that try to capture information return empty strings.
What are the conditions for empty strings to be matched?
If you group your regex in a way that the engine catches an empty substring in the string matched, it will return empty strings :-)
I think you are following the exercise from "automate the boring stuff with python". In the regex in VERBOSE mode on page 178, try to look for the opening parentheses. Where is the closing parentheses? The number of groups is equivalent with the number of opening parentheses. The whole regex is group number zero.
Groups are useful if you want to extract certain parts of a matched string. If you only want to extract full phone numbers, leave the groups away.
You may try just this:
phoneRegex = re.compile(r'\d{3}[\.|-|\/]\d{3}[\.|-|\/]\d{4}')
Is this what you want to achieve?
If you want to stick with your regex in VERBOSE mode, you could also use non-capturing groups. This only captures a full match:
phoneRegex = re.compile(r'''(
(?:\d{3}|\(?:\d{3}\))?
(?:\s|-|\.)? # separator
(?:\d{3}) # first 3 digits
(?:\s|-|\.) # separator
(?:\d{4}) # last 4 digits
(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # extension
)''', re.VERBOSE)

Python regex, capture groups that are not specific

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.

re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $

Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.

Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

Python regex "OR" gives empty string when using findall

I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]

Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]

regex in python, can this be improved upon?

I have this piece of code that finds words that begin with # or #,
p = re.findall(r'#\w+|#\w+', str)
Now what irks me about this is repeating \w+. I am sure there is a way to do something like
p = re.findall(r'(#|#)\w+', str)
That will produce the same result but it doesn't, it instead returns only # and #. How can that regex be changed so that I am not repeating the \w+? This code comes close,
p = re.findall(r'((#|#)\w+)', str)
But it returns [('#many', '#'), ('#this', '#'), ('#tweet', '#')] (notice the extra '#', '#', and '#'.
Also, if I'm repeating this re.findall code 500,000 times, can this be compiled and to a pattern and then be faster?

The solution
You have two options:
Use non-capturing group: (?:#|#)\w+
Or even better, a character class: [##]\w+
References
regular-expressions.info/Character Class and Groups
Understanding findall
The problem you were having is due to how findall return matches depending on how many capturing groups are present.
Let's take a closer look at this pattern (annotated to show the groups):
((#|#)\w+)
|\___/ |
|group 2 | # Read about groups to understand
\________/ # how they're defined and numbered/named
group 1
Capturing groups allow us to save the matches in the subpatterns within the overall patterns.
p = re.compile(r'((#|#)\w+)')
m = p.match('#tweet')
print m.group(1)
# #tweet
print m.group(2)
# #
Now let's take a look at the Python documentation for the re module:
findall: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This explains why you're getting the following:
str = 'lala #tweet boo #this &that #foo#bar'
print(re.findall(r'((#|#)\w+)', str))
# [('#tweet', '#'), ('#this', '#'), ('#foo', '#'), ('#bar', '#')]
As specified, since the pattern has more than one group, findall returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.
The documentation also explains why you're getting the following:
print(re.findall(r'(#|#)\w+', str))
# ['#', '#', '#', '#']
Now the pattern only has one group, and findall returns a list of matches for that group.
In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:
print(re.findall(r'(?:#|#)\w+', str))
# ['#tweet', '#this', '#foo', '#bar']
print(re.findall(r'[##]\w+', str))
# ['#tweet', '#this', '#foo', '#bar']
References
docs.python.org - Regular Expression HOWTO
Compiling Regular Expressions
Grouping | Non-capturing and Named Groups
docs.python.org - re module
Attachments
Snippet with output on ideone.com

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python re.split() doubtful result - python

Related

What's the regular expression pattern that matches the part of a comma-separated number starting from leftmost comma to the last whole number digit

Regex | Empty String match | Python 3.4.0

Python regex, capture groups that are not specific

Python regex "OR" gives empty string when using findall

regex in python, can this be improved upon?

Categories

Resources