"." and "+" not working properly - python

I have a string, where
text='<tr align="right"><td>12</td><td>John</td>
and I would like to extract the tuple ('12', 'John'). It is working fine when I am using
m=re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
but I am getting ('2', 'John'), when I am using
m=re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
Why is it going wrong? I mean why .{13} works fine, but .+ fails to work in my re?
Thank you!

You should really be using a proper HTML parser library for this, ie:
>>> a = '<tr align="right"><td>12</td><td>John</td>'
>>> p = lxml.html.fromstring(a)
>>> p.text_content()
'12John'
>>> p.xpath('//td/text()')
['12', 'John']
Obviously you'd need to work this better for multiple occurrences...

I can't actually test this with the sample text and regexps you provided, because as written they clearly should find no matches, and in fact do find no matches in both 2.7 and 3.3.
But I'm guessing that you want a non-greedy match, and changing .+ to .+? will fix whatever your problem is.
As Jon Clements points out in his answer, you really shouldn't be using regular expressions here. Regexps cannot actually parse non-regular languages like XML. Of course, despite what the purists say, regexps can still be a useful hack for non-regular languages in quick&dirty cases. But as soon as you run into something that isn't working, the first think you ought to do is consider that maybe this isn't one of those quick&dirty cases, and you should look for a real parser. Even if you'd never used the ElementTree API before, or XPath, they're pretty easy to learn, and the time spent learning is definitely not wasted, as it will come in handy many times in the future.
But anyway… let's reduce your sample to something that works as you describe, and see what this does:
>>> text='<tr align="right"><td>12</td><td>John</td>
SyntaxError: EOL while scanning string literal
>>> text='<tr align="right"><td>12</td><td>John</td>'
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+)', text)
[('2', 'John')]
I think this is what you were complaining about. Well, .+ is not "not working properly"; it's doing exactly what you asked it to: match at least one character, and as many as possible, up to the point where the rest of the expression still has something to match. Which includes matching the 1, because the rest of the expression still matches.
If you want it to instead stop matching as soon as the rest of the expression can take over, that's a non-greedy match, not a greedy match, so you want +? rather than +. Let's try it:
>>> re.findall(r'align.+?(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
Tada.

When you use .+, it will match as many characters as it can. Since the \d+ only needs to match at least one digit, the .+ will match "="right"><td>1" and leave only the "2" to be matched by the \d+.
Your original example is working for your sample data. If you need to write a regex that works on other data, you'll need to explain what the format of that data is and how you want to decide what parts to extract.
Also, given that you seem to be parsing HTML, you're probably better off using something like BeautifulSoup instead of regexes.

Related

How can I speed up an email-finding regular expression when searching through a massive string?

I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo#bar.com e34y9u394y3h4jhhrjg bar#foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo#bar.com and bar#foo.com
re.findall(r'[\w\.-]{1,100}#[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches #bar.com and #foo.com
re.findall(r'#[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the #x.x part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.
You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but
instead follows a linked set of nodes, and it can work breadth-wise as
well as depth-first, which makes it perform much better when faced
with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re (and regex, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}#[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}#[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693
Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
Split the string on spaces.
Filter the result down to the parts that contain #.
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.
Another idea:
in a loop....
use .index("#") to find the position of the next candidate
extend e.g. 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match

Regular expression for 'b' not preceded by an odd number of 'a's [duplicate]

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.

regex match proc name without slash

I have a list of proc names on Linux. Some have slash, some don't. For example,
kworker/23:1
migration/39
qmgr
I need to extract just the proc name without the slash and the rest. I tried a few different ways but still won't get it completely correct. What's wrong with my regex? Any help would be much appreciated.
>>> str='kworker/23:1'
>>> match=re.search(r'^(.+)\/*',str)
>>> match.group(1)
'kworker/23:1'
The problem with the regex is, that the greedy .+ is going until the end, because everything after it is optional, meaning it is kept as short as possible (essentially empty). To fix this replace the . with anything but a /.
([^\/]+)\/?.*
works. You can test this regex here. In case it is new to you, [^\/] matches anything, but a slash., as the ^ in the beginning inverts which characters are matched.
Alternatively, you can also use split as suggested by Moses Koledoye. split is often better for simple string manipulation, while regex enables you to perform very complex tasks with rather little code.
An alternative to regex is to split on slash and take the first item:
>>> s ='kworker/23:1'
>>> s.split('/')[0]
'kworker'
This also works when the string does not contain a slash:
>>> s = 'qmgr'
>>> s.split('/')[0]
'qmgr'
But if you're going to stick to re, I think re.sub is what you want, as you won't need to fetch the matching group:
>>> import re
>>> s ='kworker/23:1'
>>> re.sub(r'/.*$', '', s)
'kworker'
On a side note, assignig the name str shadows the in built string type, which you don't want.

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

Clarification on Python regexes and findall()

I came across this problem as I was working on the Python Challenge. Number 10 to be exact. I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that.
So the regex I developed was: '(\d)\1*'
It worked well on the online regex tester, but when using it in my script it didn't perform the same:
regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)
> ['1', '1', '1', '1', '2', '2', '2',...]
And so on and so forth. So I learn about raw type in the re module for Python. Which is my first question: can someone please explain what exactly this does? The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \d+ and I don't understand why.
So I change my regex to r'(\d)\1*' and now try and use findall() to make a list of the sequences. And I get
> ['1', '2', '3']
Very confused again. I still don't understand this. Help please?
I decided to do this to get around this:
[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']
And get what I've been looking for. Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\d)\2*)'.
I end up getting:
> [('1111', '1'), ('2222', '2'), ('3333', '3')]
At this point I'm all kinds of confused. I know that this result has something to do with multiple groups, but I'm just not sure.
Also, this is my first time posting so I apologize if my etiquette isn't correct. Please feel free to correct me on that as well. Thanks!
Since this is the challenge I won't give you a complete answer. You are on the right track however.
The finditer method returns MatchObject instances. You want to look at the .group() method on these and read the documentation carefully. Think about what the difference is between .group(0) and .group(1) there; plain .group() is the same as .group(0).
As for the \d escape character; because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letter d. It would indeed be better to use the r'' literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. See the python documentation on string literals for more information.
Your .findall() with the r'((\d)\2*)' expression returns 2 elements per match as you have 2 groups in your pattern; the outer, whole group matching (\d)\2* and the inner group matching \d. From the .findall() documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Categories

Resources