I'm trying to find ways to do it other than these two:
# match last occurence of \d+, 24242 in this case
>>> test = "123_4242_24242lj.:"
>>> obj = re.search(r"\d+(?!.*\d)", test)
>>> obj.group()
'24242'
>>> re.findall(r"\d+", test)[-1]
'24242'
I'm sure you can find more clever regular expressions that will do this, but I think you should stick with findall().
Regular expressions are hard to read. Not just by others: let 10 days go by since the time you wrote one, and you'll find it hard to read too. This makes them hard to maintain.
Unless performance is critical, it's always best to minimize the work done by regular expressions. This line...
re.findall(r"\d+", test)[-1]
... is clean, concise and immediately obvious.
This lookahead based regex matches last digits in a string:
\d+(?=\D*$)
I'm trying to find ways to do it other than these two:
A slight modification to your first approach. Capture the digits followed by anything that is not a digit at the end of the string.
>>> import re
>>> test = "123_4242_24242lj.:"
>>> print re.findall(r'(\d+)\D*$', test)
['24242']
>>>
Another alternate would be to substitute:
>>> re.sub(r'.*?(\d+)\D*$', "\\1", test)
'24242'
Related
I have a list of proc names on Linux. Some have slash, some don't. For example,
kworker/23:1
migration/39
qmgr
I need to extract just the proc name without the slash and the rest. I tried a few different ways but still won't get it completely correct. What's wrong with my regex? Any help would be much appreciated.
>>> str='kworker/23:1'
>>> match=re.search(r'^(.+)\/*',str)
>>> match.group(1)
'kworker/23:1'
The problem with the regex is, that the greedy .+ is going until the end, because everything after it is optional, meaning it is kept as short as possible (essentially empty). To fix this replace the . with anything but a /.
([^\/]+)\/?.*
works. You can test this regex here. In case it is new to you, [^\/] matches anything, but a slash., as the ^ in the beginning inverts which characters are matched.
Alternatively, you can also use split as suggested by Moses Koledoye. split is often better for simple string manipulation, while regex enables you to perform very complex tasks with rather little code.
An alternative to regex is to split on slash and take the first item:
>>> s ='kworker/23:1'
>>> s.split('/')[0]
'kworker'
This also works when the string does not contain a slash:
>>> s = 'qmgr'
>>> s.split('/')[0]
'qmgr'
But if you're going to stick to re, I think re.sub is what you want, as you won't need to fetch the matching group:
>>> import re
>>> s ='kworker/23:1'
>>> re.sub(r'/.*$', '', s)
'kworker'
On a side note, assignig the name str shadows the in built string type, which you don't want.
I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.
>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.
I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())
Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'
Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)
I have a string, where
text='<tr align="right"><td>12</td><td>John</td>
and I would like to extract the tuple ('12', 'John'). It is working fine when I am using
m=re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
but I am getting ('2', 'John'), when I am using
m=re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
Why is it going wrong? I mean why .{13} works fine, but .+ fails to work in my re?
Thank you!
You should really be using a proper HTML parser library for this, ie:
>>> a = '<tr align="right"><td>12</td><td>John</td>'
>>> p = lxml.html.fromstring(a)
>>> p.text_content()
'12John'
>>> p.xpath('//td/text()')
['12', 'John']
Obviously you'd need to work this better for multiple occurrences...
I can't actually test this with the sample text and regexps you provided, because as written they clearly should find no matches, and in fact do find no matches in both 2.7 and 3.3.
But I'm guessing that you want a non-greedy match, and changing .+ to .+? will fix whatever your problem is.
As Jon Clements points out in his answer, you really shouldn't be using regular expressions here. Regexps cannot actually parse non-regular languages like XML. Of course, despite what the purists say, regexps can still be a useful hack for non-regular languages in quick&dirty cases. But as soon as you run into something that isn't working, the first think you ought to do is consider that maybe this isn't one of those quick&dirty cases, and you should look for a real parser. Even if you'd never used the ElementTree API before, or XPath, they're pretty easy to learn, and the time spent learning is definitely not wasted, as it will come in handy many times in the future.
But anyway… let's reduce your sample to something that works as you describe, and see what this does:
>>> text='<tr align="right"><td>12</td><td>John</td>
SyntaxError: EOL while scanning string literal
>>> text='<tr align="right"><td>12</td><td>John</td>'
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+)', text)
[('2', 'John')]
I think this is what you were complaining about. Well, .+ is not "not working properly"; it's doing exactly what you asked it to: match at least one character, and as many as possible, up to the point where the rest of the expression still has something to match. Which includes matching the 1, because the rest of the expression still matches.
If you want it to instead stop matching as soon as the rest of the expression can take over, that's a non-greedy match, not a greedy match, so you want +? rather than +. Let's try it:
>>> re.findall(r'align.+?(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
Tada.
When you use .+, it will match as many characters as it can. Since the \d+ only needs to match at least one digit, the .+ will match "="right"><td>1" and leave only the "2" to be matched by the \d+.
Your original example is working for your sample data. If you need to write a regex that works on other data, you'll need to explain what the format of that data is and how you want to decide what parts to extract.
Also, given that you seem to be parsing HTML, you're probably better off using something like BeautifulSoup instead of regexes.
EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?
Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)
The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.
Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.
The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.
Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.
With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.
On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.
As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.
Answer to original question:
Why does the first search() span
multiple "/"s rather than taking the
shortest match?
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.
Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.
There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
>>> import re
>>> string = "baaa"
>>>
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>>
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>>
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'
Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.
>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'
Edit: Removed text related to old versions of the question.
Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.
>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis
I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]