re.Pattern.findall works wrong - python

I am trying to match all pattern in a string by pattern.findall,but it only works partly
code
#--coding:UTF-8 --
import re
import pprint
regex = r"(19|20|21)\d{2}"
text = "1912 2013 2134"
def main():
pattern = re.compile(regex)
print pattern.findall(text)
if __name__ == '__main__':
main()
and it print:
['19', '20', '21']
should it print ['1912', '2013','2134']

Quoting from the re.findall docs,
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Since your original RegEx had one capturing group ((19|20|21)), the value captured in that alone was returned. You can play with that like this
regex = r"(19|20|21)(\d{2})"
Now we have two capturing groups ((19|20|21) and (\d{2})). Then the result would have been
[('19', '12'), ('20', '13'), ('21', '34')]
To fix this, you can use non-capturing group, like this
regex = r"(?:19|20|21)\d{2}"
which gives the following output
['1912', '2013', '2134']

It's working correctly, you're only capturing 19,20,21 in the capturing group of (19|20|21).
You need a non-capturing group by changing it to (?:19|20|21), as from the documentation.
Source: https://docs.python.org/2/howto/regex.html#non-capturing-and-named-groups

Round brackets indicate matching groups. In your regex, you are looking for two digit numerals which are either 19, 20 or 21.
Perhaps you need this regex:
r'19\d{2}|20\d{2}|21\d{2}'
This looks for any number starting with 19 followed by two digits or 20 followed by two digits or a 21 followed by two digits.
Demo:
In [1]: import re
In [2]: regex =rr'19\d{2}|20\d{2}|21\d{2}'
In [3]: text = "1912 2013 2134"
In [4]: pattern = re.compile(regex)
In [5]: pattern.findall(text)
Out[5]: ['1912', '2013', '2134']

Another alternative could be to refrain from findall() and instead do
print [i.group(0) for i in pattern.finditer(text)]
finditer() gives you an iterable producing Match objects. They can be queried about the properties of each match.
The other solution are more elegant about what the regexps are capable to, but this one is more flexible as you don't have this implicit assumption about the groups which should be returned.

Related

Python regex, capture groups that are not specific

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.
re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $
Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.
Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

python regex find text with digits

I have text like:
sometext...one=1290...sometext...two=12985...sometext...three=1233...
How can I find one=1290 and two=12985 but not three or four or five? There are can be from 4 to 5 digits after =. I tried this:
import re
pattern = r"(one|two)+=+(\d{4,5})+\D"
found = re.findall(pattern, sometext, flags=re.IGNORECASE)
print(found)
It gives me results like: [('one', '1290')].
If i use pattern = r"((one|two)+=+(\d{4,5})+\D)" it gives me [('one=1290', 'one', '1290')]. How can I get just one=1290?
You were close. You need to use a single capture group (or none for that matter):
((?:one|two)+=+\d{4,5})+
Full code:
import re
string = 'sometext...one=1290...sometext...two=12985...sometext...three=1233...'
pattern = r"((?:one|two)+=+\d{4,5})+"
found = re.findall(pattern, string, flags=re.IGNORECASE)
print(found)
# ['one=1290', 'two=12985']
Make the inner groups non capturing: ((?:one|two)+=+(?:\d{4,5})+\D)
The reason that you are getting results like [('one', '1290')] rather than one=1290 is because you are using capture groups. Use:
r"(?:one|two)=(?:\d{4,5})(?=\D)"
I have removed the additional + repeaters, as they were (I think?) unnecessary. You don't want to match things like oneonetwo===1234, right?
Using (?:...) rather than (...) defines a non-capture group. This prevents the result of the capture from being returned, and you instead get the whole match.
Similarly, using (?=\D) defines a look-ahead - so this is excluded from the match result.

Modify regular expression

I am trying to get first pair of numbers from "09_135624.jpg"
My code now:
import re
string = "09_135624.jpg"
pattern = r"(?P<pair>(.*))_135624.jpg"
match = re.findall(pattern, string)
print match
Output:
[('09', '09')]
Why I have tuple in output?
Can you help me modify my code to get this:
['09']
Or:
'09'
re.findall returns differently according to the number of capturing group in the pattern:
>>> re.findall(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
['09']
According to the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Alternative using re.search:
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
<_sre.SRE_Match object at 0x00000000025D0D50>
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group('pair')
'09'
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group(1)
'09'
UPDATE
To match . literally, you need to escape it: \..
(?P<pair>(?:.*))_135624.jpg
Try this. You are getting two results because you are capturing them twice. I have modified it to capture only once:
http://regex101.com/r/lS5tT3/62

re.findall not returning full match?

I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups. Here's the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1 has changed to \2.
As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:)), you can use finditer instead of findall. This gives an iterator of Match objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']
'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign=.
If you want the sign= returned, you can do:
re.findall('(size=50|size=51);',myfile)
I think what you want is using [] instead of (). [] indicates a set of characters while () indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)

Regex and a sequences of patterns?

Is there a way to match a pattern (e\d\d) several times, capturing each one into a group? For example, given the string..
blah.s01e24e25
..I wish to get four groups:
1 -> blah
2 -> 01
3 -> 24
4 -> 25
The obvious regex to use is (in Python regex:
import re
re.match("(\w+).s(\d+)e(\d+)e(\d+)", "blah.s01e24e25").groups()
..but I also want to match either of the following:
blah.s01e24
blah.s01e24e25e26
You can't seem to do (e\d\d)+, or rather you can, but it only captures the last occurrence:
>>> re.match("(\w+).s(\d+)(e\d\d){2}", "blah.s01e24e25e26").groups()
('blah', '01', 'e25')
>>> re.match("(\w+).s(\d+)(e\d\d){3}", "blah.s01e24e25e26").groups()
('blah', '01', 'e26')
I want to do this in a single regex because I have multiple patterns to match TV episode filenames, and do not want to duplicate each expression to handle multiple episodes:
\w+\.s(\d+)\.e(\d+) # matches blah.s01e01
\w+\.s(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02
\w+\.s(\d+)\.e(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02e03
\w - \d+x\d+ # matches blah - 01x01
\w - \d+x\d+\d+ # matches blah - 01x01x02
\w - \d+x\d+\d+\d+ # matches blah - 01x01x02x03
..and so on for numerous other patterns.
Another thing to complicate matters - I wish to store these regexs in a config file, so a solution using multiple regexs and function calls is not desired - but if this proves impossible I'll just allow the user to add simple regexs
Basically, is there a way to capture a repeating pattern using regex?
Do it in two steps, one to find all the numbers, then one to split them:
import re
def get_pieces(s):
# Error checking omitted!
whole_match = re.search(r'\w+\.(s\d+(?:e\d+)+)', s)
return re.findall(r'\d+', whole_match.group(1))
print get_pieces(r"blah.s01e01")
print get_pieces(r"blah.s01e01e02")
print get_pieces(r"blah.s01e01e02e03")
# prints:
# ['01', '01']
# ['01', '01', '02']
# ['01', '01', '02', '03']
Number of captured groups equal to number of parentheses groups. Look at findall or finditer for solving your problem.
non-grouping parentheses:
(?:asdfasdg)
which do not have to appear:
(?:adsfasdf)?
c = re.compile(r"""(\w+).s(\d+)
(?:
e(\d+)
(?:
e(\d+)
)?
)?
""", re.X)
or
c = re.compile(r"""(\w+).s(\d+)(?:e(\d+)(?:e(\d+))?)?""", re.X)
After thinking about the problem, I think I have a simpler solution, using named groups.
The simplest regex a user (or I) could use is:
(\w+\).s(\d+)\.e(\d+)
The filename parsing class will take the first group as the show name, second as season number, third as episode number. This covers a majority of files.
I'll allow a few different named groups for these:
(?P<showname>\w+\).s(?P<seasonnumber>\d+)\.e(?P<episodenumber>\d+)
To support multiple episodes, I'll support two named groups, something like startingepisodenumber and endingepisodenumber to support things like showname.s01e01-03:
(?P<showname>\w+\)\.s(?P<seasonnumber>\d+)\.e(?P<startingepisodenumber>\d+)-(?P<endingepisodenumber>e\d+)
And finally, allow named groups with names matching episodenumber\d+ (episodenumber1, episodenumber2 etc):
(?P<showname>\w+\)\.
s(?P<seasonnumber>\d+)\.
e(?P<episodenumber1>\d+)
e(?P<episodenumber2>\d+)
e(?P<episodenumber3>\d+)
It still requires possibly duplicating the patterns for different amounts of e01s, but there will never be a file with two non-consecutive episodes (like show.s01e01e03e04), so using the starting/endingepisodenumber groups should solve this, and for weird cases users come across, they can use the episodenumber\d+ group names
This doesn't really answer the sequence-of-patterns question, but it solves the problem that led me to ask it! (I'll still accept another answer that shows how to match s01e23e24...e27 in one regex - if someone works this out!)
Perhaps something like that?
def episode_matcher(filename):
m1= re.match(r"(?i)(.*?)\.s(\d+)((?:e\d+)+)", filename)
if m1:
m2= re.findall(r"\d+", m1.group(3))
return m1.group(1), m1.group(2), m2
# auto return None here
>>> episode_matcher("blah.s01e02")
('blah', '01', ['02'])
>>> episode_matcher("blah.S01e02E03")
('blah', '01', ['02', '03'])

Categories

Resources