Regex quantifiers - python

I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">

As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']

This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.

Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.

The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info

Related

Beautiful soup if class not like "string" or regex

I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!
This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:
^ asserts position at start of a line
Capturing Group ((?!listing-col-).)*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
Negative Lookahead (?!listing-col-).
Assert that the Regex below does not match.
listing-col- matches the characters listing-col- literally (case sensitive)
. matches any character
$ asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.
One possible solution is utilizing regex directly.
You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]

Python regex: Difference between (.+) and (.+?)

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!
ps. this is a starbucks stock quote scraper.
import urllib
import re
url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)
print found
.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.
.+? is not -- it stops at the first opportunity.
Examples:
Assume you have this HTML:
<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>
This regex matches the whole thing:
<span id="yfs_l84_sbux">(.+)<\/span>
It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.
But this regex stops at the first </span>:
<span id="yfs_l84_sbux">(.+?)<\/span>
? is a non-greedy modifier. * by default is a greedy repetition operator - it will gobble up everything it can; when modified by ? it becomes non-greedy and will eat up only as much as will satisfy it.
Thus for
<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>
.*?</span> will eat up want, then hit </span> - and this satisfies the regexp with minimal repetitions of ., resulting in <span id="yfs_l84_sbux">want</span> being the match. However, .* will try to see if it can eat more - it will go and find the other </span>, with .*? matching want</span>text<span id="somethingelse">dontwant, resulting in what you got - much more than you wanted.
(.+) is greedy. It takes what it can and gives back when needed.
(.+?) is ungreedy. It takes as few as possible.
See:
delegate
[delegate] /^(.+)e/
[de]legate /^(.+?)e/
Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.

Regular Expressions: Find Names in String using Python

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)
<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))
Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.
This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

file cleaner using regex

So I am trying to write my own scripts that will take in html files and return errors as well as clean them (doing this to learn regex and because I find it useful)
I am starting by having a quick function that takes the document, and grabs all of the tags in the correct order so I can check to make sure that they are all closed...I use the following:
>>> s = """<a>link</a>
... <div id="something">
... <p style="background-color:#f00">paragraph</p>
... </div>"""
>>> re.findall('(?m)<.*>',s)
['<a>link</a>', '<div id="something">', '<p style="background-color:#f00">paragraph</p>', '</div>']
I understand that it grabs everything between the two carrot brackets, and that that becomes the whole line. What would I use to return the following:
['<a>','</a>', '<div id="something">', '<p style="background-color:#f00">','</p>', '</div>']
re.findall('(?m)<.*?>',s)
-- or --
re.findall('(?m)<[^>]*>',s)
The question mark after the * causes it to be a non-greedy match, meaning that it only takes as much as it needs, as opposed to normal, where it takes as much as possible.
The second form is used more often, and it uses a character class to match everything but <, since that will never exist anywhere inside the tag excepting the end.
Although you really shouldn't be parsing HTML with regex, I understand that this is a learning exercise.
You only need to add one more character:
>>> re.findall('(?m)<.*?>',s) # See the ? after .*
['<a>', '</a>', '<div id="something">', '<p style="background-color:#f00">', '</p>', '</div>']
*? matches 0 or more of the preceeding value (in this case, .). This is a lazy match, and will match as few characters as possible.
re.findall('(?m)<[^<^>.]+>',s)

Can I have a non-greedy regex with dotall?

I would like to match dotall and non-greedy. This is what I have:
img(.*?)(onmouseover)+?(.*?)a
However, this is not being non-greedy. This data is not matching as I expected:
<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to
describe a range of nouns, followed by writing a postcard to describe a
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: Drafting </td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"><br>790.0 k<br>
and I cannot understand why.
What I think I am stating in the above regex is:
start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"
Why doesn't this work as I expected?
KEY POINT: dotall must be enabled
It is being non-greedy.
It is your understanding of non-greedy that is not correct.
A regex will always try to match.
Let me show a simplified example of what non-greedy actually means(as suggested by a comment):
re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)
This will match:
as few repetitions of 'a' as possible (in this case 2)
followed by a 'b'
and as few repetitions of 'c' as possible (in this case 0)
so the only match is 'aab'.
And just to conclude:
Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.
First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".
This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.
Second, and this is significanly more important:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
And in case you didn't understand it the first time, let me repeat it in italics:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
Finally, let me link you to the canonical grimoire on the subject:
You can't parse [X]HTML with a regex

Categories

Resources