regex in python, can this be improved upon?

regex in python, can this be improved upon? - python

I have this piece of code that finds words that begin with # or #,
p = re.findall(r'#\w+|#\w+', str)
Now what irks me about this is repeating \w+. I am sure there is a way to do something like
p = re.findall(r'(#|#)\w+', str)
That will produce the same result but it doesn't, it instead returns only # and #. How can that regex be changed so that I am not repeating the \w+? This code comes close,
p = re.findall(r'((#|#)\w+)', str)
But it returns [('#many', '#'), ('#this', '#'), ('#tweet', '#')] (notice the extra '#', '#', and '#'.
Also, if I'm repeating this re.findall code 500,000 times, can this be compiled and to a pattern and then be faster?

The solution
You have two options:
Use non-capturing group: (?:#|#)\w+
Or even better, a character class: [##]\w+
References
regular-expressions.info/Character Class and Groups
Understanding findall
The problem you were having is due to how findall return matches depending on how many capturing groups are present.
Let's take a closer look at this pattern (annotated to show the groups):
((#|#)\w+)
|\___/ |
|group 2 | # Read about groups to understand
\________/ # how they're defined and numbered/named
group 1
Capturing groups allow us to save the matches in the subpatterns within the overall patterns.
p = re.compile(r'((#|#)\w+)')
m = p.match('#tweet')
print m.group(1)
# #tweet
print m.group(2)
# #
Now let's take a look at the Python documentation for the re module:
findall: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This explains why you're getting the following:
str = 'lala #tweet boo #this &that #foo#bar'
print(re.findall(r'((#|#)\w+)', str))
# [('#tweet', '#'), ('#this', '#'), ('#foo', '#'), ('#bar', '#')]
As specified, since the pattern has more than one group, findall returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.
The documentation also explains why you're getting the following:
print(re.findall(r'(#|#)\w+', str))
# ['#', '#', '#', '#']
Now the pattern only has one group, and findall returns a list of matches for that group.
In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:
print(re.findall(r'(?:#|#)\w+', str))
# ['#tweet', '#this', '#foo', '#bar']
print(re.findall(r'[##]\w+', str))
# ['#tweet', '#this', '#foo', '#bar']
References
docs.python.org - Regular Expression HOWTO
Compiling Regular Expressions
Grouping | Non-capturing and Named Groups
docs.python.org - re module
Attachments
Snippet with output on ideone.com

Related

python regex - findall not returning output as expected

I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']

When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.

This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'

Python regex, capture groups that are not specific

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.

re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $

Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.

Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

Regular expression matching all but a string

I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.

You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)

You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']

Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.

Python regex: finding sequence of chars inside a string

I'm using python and regex (new to both) to find sequence of chars in a string as follows:
Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:
ascjksdcvyp12nbvnzxcmgonbmbh12hjg23
should yield p12 go 12 23.
ascjksdcvyp12nbvnzxcmsnbmbh12hjg23
should yield p12 s 12 23.
I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':
decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12
I know by doing something like
re.findall(r'[s]|[go]',myStr)
will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.

Use re.findall with pattern grouping:
>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]
>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
With re.findall we are only willing to get what are matched by pattern grouping ()
p\d{2} matches any two digits after p
After that .* matches anything
Then, s|go matches either s or go
\D* matches any number of non-digits
\d+ indicates one or more digits
(?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens
Note:
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]
I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.

First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about,
then chop off the remainder of the string and search for numbers as a second step.
import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
p_num = m.group(1)
trailing_numbers = [m.group(2)]
remainder = line[m.end()+1:]
trailing_numbers.extend( # extend list by appending
map( # list from applying
lambda m: m.group(1), # get group(1) from match
re.finditer(r"(\d+)", remainder) # of each number in string
)
)
print("P:", p_num, "Numbers:", trailing_numbers)

re.findall not returning full match?

I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?

The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.

When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups. Here's the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1 has changed to \2.

As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:)), you can use finditer instead of findall. This gives an iterator of Match objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']

'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign=.
If you want the sign= returned, you can do:
re.findall('(size=50|size=51);',myfile)

I think what you want is using [] instead of (). [] indicates a set of characters while () indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex in python, can this be improved upon? - python

Related

python regex - findall not returning output as expected

Python regex, capture groups that are not specific

Regular expression matching all but a string

Python regex: finding sequence of chars inside a string

re.findall not returning full match?

Categories

Resources