Clarification on Python regexes and findall() - python

I came across this problem as I was working on the Python Challenge. Number 10 to be exact. I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that.
So the regex I developed was: '(\d)\1*'
It worked well on the online regex tester, but when using it in my script it didn't perform the same:
regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)
> ['1', '1', '1', '1', '2', '2', '2',...]
And so on and so forth. So I learn about raw type in the re module for Python. Which is my first question: can someone please explain what exactly this does? The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \d+ and I don't understand why.
So I change my regex to r'(\d)\1*' and now try and use findall() to make a list of the sequences. And I get
> ['1', '2', '3']
Very confused again. I still don't understand this. Help please?
I decided to do this to get around this:
[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']
And get what I've been looking for. Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\d)\2*)'.
I end up getting:
> [('1111', '1'), ('2222', '2'), ('3333', '3')]
At this point I'm all kinds of confused. I know that this result has something to do with multiple groups, but I'm just not sure.
Also, this is my first time posting so I apologize if my etiquette isn't correct. Please feel free to correct me on that as well. Thanks!

Since this is the challenge I won't give you a complete answer. You are on the right track however.
The finditer method returns MatchObject instances. You want to look at the .group() method on these and read the documentation carefully. Think about what the difference is between .group(0) and .group(1) there; plain .group() is the same as .group(0).
As for the \d escape character; because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letter d. It would indeed be better to use the r'' literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. See the python documentation on string literals for more information.
Your .findall() with the r'((\d)\2*)' expression returns 2 elements per match as you have 2 groups in your pattern; the outer, whole group matching (\d)\2* and the inner group matching \d. From the .findall() documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Related

Python re.sub: replace part of matching string that contains an arbitrary number of capturing groups

I know that there are other questions which deal with the problem of replacing only part of a matching string using re.sub, but the answers revolve around referring back to capturing groups. My situation is a bit different:
I'm generating regexes like '(?:i|æ|ʏ|ɞ).(?:i|æ|ʏ|ɞ)' and ^. in another part of the application. If I have the string 'abcd', and the pair ('b', 'c'), I want to replace all instances of b where the regex matches at the period character (.).
For example, if I have the rule '(?:x|y|z).(?:h|i|j)', and the desired change is a to b, the following should occur:
xah -> xbh
yai -> ybi
zaz -> zaz (no change)
I've tried using re.sub, replacing the . with my target in the search string and with my replacement in the replacement string, but this replaces the whole match in the target string, when in reality I only want to change a small part. My problem with using match groups and referring back to them in the replacement is that I don't know how many there will be, or what order they'll be in - there might not even be any - so I'm trying to find a flexible solution.
Any help is very appreciated! It's quite difficult to explain, so if further clarification is needed please ask :).
You could use "lookahead" and "lookbehind" assertions, like so:
import re
tests = (
('xah', 'xbh'),
('yai', 'ybi'),
('zaz', 'zaz'),
)
for test_in, test_out in tests:
out = re.sub('(?<=x|y|z)a(?=h|i|j)', 'b', test_in)
assert test_out == out

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

"." and "+" not working properly

I have a string, where
text='<tr align="right"><td>12</td><td>John</td>
and I would like to extract the tuple ('12', 'John'). It is working fine when I am using
m=re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
but I am getting ('2', 'John'), when I am using
m=re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
Why is it going wrong? I mean why .{13} works fine, but .+ fails to work in my re?
Thank you!
You should really be using a proper HTML parser library for this, ie:
>>> a = '<tr align="right"><td>12</td><td>John</td>'
>>> p = lxml.html.fromstring(a)
>>> p.text_content()
'12John'
>>> p.xpath('//td/text()')
['12', 'John']
Obviously you'd need to work this better for multiple occurrences...
I can't actually test this with the sample text and regexps you provided, because as written they clearly should find no matches, and in fact do find no matches in both 2.7 and 3.3.
But I'm guessing that you want a non-greedy match, and changing .+ to .+? will fix whatever your problem is.
As Jon Clements points out in his answer, you really shouldn't be using regular expressions here. Regexps cannot actually parse non-regular languages like XML. Of course, despite what the purists say, regexps can still be a useful hack for non-regular languages in quick&dirty cases. But as soon as you run into something that isn't working, the first think you ought to do is consider that maybe this isn't one of those quick&dirty cases, and you should look for a real parser. Even if you'd never used the ElementTree API before, or XPath, they're pretty easy to learn, and the time spent learning is definitely not wasted, as it will come in handy many times in the future.
But anyway… let's reduce your sample to something that works as you describe, and see what this does:
>>> text='<tr align="right"><td>12</td><td>John</td>
SyntaxError: EOL while scanning string literal
>>> text='<tr align="right"><td>12</td><td>John</td>'
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+)', text)
[('2', 'John')]
I think this is what you were complaining about. Well, .+ is not "not working properly"; it's doing exactly what you asked it to: match at least one character, and as many as possible, up to the point where the rest of the expression still has something to match. Which includes matching the 1, because the rest of the expression still matches.
If you want it to instead stop matching as soon as the rest of the expression can take over, that's a non-greedy match, not a greedy match, so you want +? rather than +. Let's try it:
>>> re.findall(r'align.+?(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
Tada.
When you use .+, it will match as many characters as it can. Since the \d+ only needs to match at least one digit, the .+ will match "="right"><td>1" and leave only the "2" to be matched by the \d+.
Your original example is working for your sample data. If you need to write a regex that works on other data, you'll need to explain what the format of that data is and how you want to decide what parts to extract.
Also, given that you seem to be parsing HTML, you're probably better off using something like BeautifulSoup instead of regexes.

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type.
I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.
Further clarification of question:
I would like the pattern to be as follows: must start with 'b', followed by three digits, followed by 'cv', followed by two digits, then an underscore, followed by 'release', followed by .'ext'
Now that you have a human readable description of your file name, it's quite straight forward to translate it into a regular expression (at least in this case ;)
must start with
The caret (^) anchors a regular expression to the beginning of what you want to match, so your re has to start with this symbol.
'b',
Any non-special character in your re will match literally, so you just use "b" for this part: ^b.
followed by [...] digits,
This depends a bit on which flavor of re you use:
The most general way of expressing this is to use brackets ([]). Those mean "match any one of the characters listed within. [ASDF] for example would match either A or S or D or F, [0-9] would match anything between 0 and 9.
Your re library probably has a shortcut for "any digit". In sed and awk you could use [[:digit:]] [sic!], in python and many other languages you can use \d.
So now your re reads ^b\d.
followed by three [...]
The most simple way to express this would be to just repeat the atom three times like this: \d\d\d.
Again your language might provide a shortcut: braces ({}). Sometimes you would have to escape them with a backslash (if you are using sed or awk, read about "extended regular expressions"). They also give you a way to say "at least x, but no more than y occurances of the previous atom": {x,y}.
Now you have: ^b\d{3}
followed by 'cv',
Literal matching again, now we have ^b\d{3}cv
followed by two digits,
We already covered this: ^b\d{3}cv\d{2}.
then an underscore, followed by 'release', followed by .'ext'
Again, this should all match literally, but the dot (.) is a special character. This means you have to escape it with a backslash: ^\d{3}cv\d{2}_release\.ext
Leaving out the backslash would mean that a filename like "b410cv11_test_ext" would also match, which may or may not be a problem for you.
Finally, if you want to guarantee that there is nothing else following ".ext", anchor the re to the end of the thing to match, use the dollar sign ($).
Thus the complete regular expression for your specific problem would be:
^b\d{3}cv\d{2}_release\.ext$
Easy.
Whatever language or library you use, there has to be a reference somewhere in the documentation that will show you what the exact syntax in your case should be. Once you have learned to break down the problem into a suitable description, understanding the more advanced constructs will come to you step by step.
To avoid confusion, read the following, in order.
First, you have the glob module, which handles file name regular expressions just like the Windows and unix shells.
Second, you have the fnmatch module, which just does pattern matching using the unix shell rules.
Third, you have the re module, which is the complete set of regular expressions.
Then ask another, more specific question.
I would like the pattern to be as
follows: must start with 'b', followed
by three digits, followed by 'cv',
followed by two digits, then an
underscore, followed by 'release',
followed by .'ext'
^b\d{3}cv\d{2}_release\.ext$
Your question is a bit unclear. You say you want a regular expression, but could it be that you want a glob-style pattern you can use with commands like ls? glob expressions and regular expressions are similar in concept but different in practice (regular expressions are considerably more powerful, glob style patterns are easier for the most common cases when looking for files.
Also, what do you consider to be the pattern? Certainly, * (glob) or .* (regex) will match the pattern. Also, _test.ext (glob) or ._test.ext (regexp) pattern would match, as would many other variations.
Can you be more specific about the pattern? For example, you might describe it as "b, followed by digits, followed by cv, followed by digits ..."
Once you can precisely explain the pattern in your native language (and that must be your first step), it's usually a fairly straight-forward task to translate that into a glob or regular expression pattern.
if the letters are unimportant, you could try \w\d\d\d\w\w\d\d_test.ext which would match the letter/number pattern, or b\d\d\dcv\d\d_test.ext or some mix of the two.
When working with regexes I find the Mochikit regex example to be a great help.
/^b\d\d\dcv\d\d_test\.ext$/
Then use the python re (regex) module to do the match. This is of course assuming regex is really what you need and not glob as the others mentioned.

Categories

Resources