Python regex, capture groups that are not specific

Python regex, capture groups that are not specific - python

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.

re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $

Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.

Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

Related

Replace part of string using a regular expression

s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
I'm trying to reset the start scores to 0.0.0.0.0.0 so that it reads
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34
This works:
re.sub('start\:(\d+\.\d+\.\d+\.\d+\.\d+\.\d+)','0.0.0.0.0.0',s)
But I was looking for something more flexible to use in case the amount of scores change.
EDIT:
Actually my example does not work, because start is removed from the string
scores:0.0.0.0.0.0:final:55.34.13.63.44.34
but I would like to keep it:
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34

You may use this re.sub with a lambda if you want to replace dot seperated numbers with same number of zeroes:
>>> import re
>>> s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
>>> re.sub(r'start:\d+(?:\.\d+)+', lambda m: re.sub(r'\d+', '0', m.group()), s)
'scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34'
We are using regex as start:\d+(?:\.\d+)+ to match text that starts with start: and followed by digits separated by dot.
In lambda part we replace each 1+ digit with 0 to get same number of zeroes in output as the input.

Could replace all numbers before final with a zero:
re.sub(r'\d+(?=.*final)', '0', s)
Try it online!
Or perhaps more efficient if there were many more scores:
re.sub(r'\d|(final.*)', lambda m: m[1] or '0', s)

Another lambda in re.sub:
import re
s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
pat=r'(?<=:start:)([\d.]+)(?=:final:)'
>>> re.sub(pat, lambda m: '.'.join(['0']*len(m.group(1).split('.'))), s)
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34

A variant using the regex PyPi module:
(?:\bstart:|\G(?!^))\.?\K\d+
The pattern matches
(?: Non capture group
\bstart: Match start:
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start of the string
) Close the non capture group
\.?\K Match an optional dot, clear the current match buffer (forget what is matched so far)
\d+ Match 1 or more digits (to be replaced with a 0)
Regex demo
import regex
s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
pattern = r"(?:\bstart:|\G(?!^))\.?\K\d+"
print(regex.sub(pattern, '0', s))
Output
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34

Python regex "OR" gives empty string when using findall

I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]

Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]

Regular expression matching all but a string

I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.

You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)

You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']

Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.

re.Pattern.findall works wrong

I am trying to match all pattern in a string by pattern.findall,but it only works partly
code
#--coding:UTF-8 --
import re
import pprint
regex = r"(19|20|21)\d{2}"
text = "1912 2013 2134"
def main():
pattern = re.compile(regex)
print pattern.findall(text)
if __name__ == '__main__':
main()
and it print:
['19', '20', '21']
should it print ['1912', '2013','2134']

Quoting from the re.findall docs,
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Since your original RegEx had one capturing group ((19|20|21)), the value captured in that alone was returned. You can play with that like this
regex = r"(19|20|21)(\d{2})"
Now we have two capturing groups ((19|20|21) and (\d{2})). Then the result would have been
[('19', '12'), ('20', '13'), ('21', '34')]
To fix this, you can use non-capturing group, like this
regex = r"(?:19|20|21)\d{2}"
which gives the following output
['1912', '2013', '2134']

It's working correctly, you're only capturing 19,20,21 in the capturing group of (19|20|21).
You need a non-capturing group by changing it to (?:19|20|21), as from the documentation.
Source: https://docs.python.org/2/howto/regex.html#non-capturing-and-named-groups

Round brackets indicate matching groups. In your regex, you are looking for two digit numerals which are either 19, 20 or 21.
Perhaps you need this regex:
r'19\d{2}|20\d{2}|21\d{2}'
This looks for any number starting with 19 followed by two digits or 20 followed by two digits or a 21 followed by two digits.
Demo:
In [1]: import re
In [2]: regex =rr'19\d{2}|20\d{2}|21\d{2}'
In [3]: text = "1912 2013 2134"
In [4]: pattern = re.compile(regex)
In [5]: pattern.findall(text)
Out[5]: ['1912', '2013', '2134']

Another alternative could be to refrain from findall() and instead do
print [i.group(0) for i in pattern.finditer(text)]
finditer() gives you an iterable producing Match objects. They can be queried about the properties of each match.
The other solution are more elegant about what the regexps are capable to, but this one is more flexible as you don't have this implicit assumption about the groups which should be returned.

Split a string using an integer as a delimeter

I have a rather long txt file filled with strings of the format {letter}{number}{letter}. For instance, the first few lines of my file are:
A123E
G234W
R3L
H4562T
I am having difficulty finding the correct regex pattern to separate each line by alpha and numeric.
For instance, in the first line, I would like an array with the results:
print first_line[0] // A
print first_line[1] // 123
ptin first_line[2] // E
It seems like regex would be the way to go, but I'm still a regex novice. Could someone help point me in the correct direction on how to do this?
Then I plan to iterate over each of the lines and use the info as necessary.

Split on \d+:
import re
re.split(r'(\d+)', line)
\d is the character class matching the digits 0 through to 9, and we want to match at least 1 of them. By putting a capturing group around the \d+, re.split() will include the match in the output:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Demo:
>>> import re
>>> re.split(r'(\d+)', 'A123E')
['A', '123', 'E']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex, capture groups that are not specific - python

Use re.split: import re s = 'abc1235abc53abcXX' re.split('abc', s) # ['', '1235', '53', 'XX'] Note that you get an empty string, representing the match before the first 'abc'.

Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below: [r for r in re.split('abc', s) if r]

Related

Replace part of string using a regular expression

Python regex "OR" gives empty string when using findall

Regular expression matching all but a string

re.Pattern.findall works wrong

Split a string using an integer as a delimeter

Categories

Resources