I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups. Here's the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1 has changed to \2.
As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:)), you can use finditer instead of findall. This gives an iterator of Match objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']
'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign=.
If you want the sign= returned, you can do:
re.findall('(size=50|size=51);',myfile)
I think what you want is using [] instead of (). [] indicates a set of characters while () indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)
Related
I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!
The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.
Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']
Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.
Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.
re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $
Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.
Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]
Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$
I have a bunch of lines in a file with either one or two occurences of same pattern (id=):
Linetype1 : ...id=1234...id=4321...value=5678... # "..." means whatever
Linetype2 : ...id=7890...value=8765
I thought I could write such a regex to grep all my ids and associated values:
>>> l="...id=1234...id=4321...value=5678...\n...id=7890...value=8765\n"
>>> ret = re.findall('(id=[0-9]+).*?(id=[0-9]+)*.*?(value=[0-9]+)',l)
[('id=1234', '', 'value=5678'), ('id=7890', '', 'value=8765')]
I can't get the second "id=4321" part.
This looks very strange to me since I use the non-greedy .*? between first id=[0-9]+ and second.
The middle of your regex has
(id=[0-9]+)*
The empty string matches this, since it is under the Kleene star *. So the regex engine proceeds through the string as follows:
find the first id=[0-9]+ group
expand .*? to the empty string, since it matches
expand (id=[0-9]+)* to the empty string, since it matches
expand .*? to the rest of the string
If you replace the middle group's quantifier with +, or just remove it entirely, then it works.
I am trying to match all pattern in a string by pattern.findall,but it only works partly
code
#--coding:UTF-8 --
import re
import pprint
regex = r"(19|20|21)\d{2}"
text = "1912 2013 2134"
def main():
pattern = re.compile(regex)
print pattern.findall(text)
if __name__ == '__main__':
main()
and it print:
['19', '20', '21']
should it print ['1912', '2013','2134']
Quoting from the re.findall docs,
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Since your original RegEx had one capturing group ((19|20|21)), the value captured in that alone was returned. You can play with that like this
regex = r"(19|20|21)(\d{2})"
Now we have two capturing groups ((19|20|21) and (\d{2})). Then the result would have been
[('19', '12'), ('20', '13'), ('21', '34')]
To fix this, you can use non-capturing group, like this
regex = r"(?:19|20|21)\d{2}"
which gives the following output
['1912', '2013', '2134']
It's working correctly, you're only capturing 19,20,21 in the capturing group of (19|20|21).
You need a non-capturing group by changing it to (?:19|20|21), as from the documentation.
Source: https://docs.python.org/2/howto/regex.html#non-capturing-and-named-groups
Round brackets indicate matching groups. In your regex, you are looking for two digit numerals which are either 19, 20 or 21.
Perhaps you need this regex:
r'19\d{2}|20\d{2}|21\d{2}'
This looks for any number starting with 19 followed by two digits or 20 followed by two digits or a 21 followed by two digits.
Demo:
In [1]: import re
In [2]: regex =rr'19\d{2}|20\d{2}|21\d{2}'
In [3]: text = "1912 2013 2134"
In [4]: pattern = re.compile(regex)
In [5]: pattern.findall(text)
Out[5]: ['1912', '2013', '2134']
Another alternative could be to refrain from findall() and instead do
print [i.group(0) for i in pattern.finditer(text)]
finditer() gives you an iterable producing Match objects. They can be queried about the properties of each match.
The other solution are more elegant about what the regexps are capable to, but this one is more flexible as you don't have this implicit assumption about the groups which should be returned.