I would like to match dotall and non-greedy. This is what I have:
img(.*?)(onmouseover)+?(.*?)a
However, this is not being non-greedy. This data is not matching as I expected:
<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to
describe a range of nouns, followed by writing a postcard to describe a
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: Drafting </td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"><br>790.0 k<br>
and I cannot understand why.
What I think I am stating in the above regex is:
start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"
Why doesn't this work as I expected?
KEY POINT: dotall must be enabled
It is being non-greedy.
It is your understanding of non-greedy that is not correct.
A regex will always try to match.
Let me show a simplified example of what non-greedy actually means(as suggested by a comment):
re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)
This will match:
as few repetitions of 'a' as possible (in this case 2)
followed by a 'b'
and as few repetitions of 'c' as possible (in this case 0)
so the only match is 'aab'.
And just to conclude:
Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.
First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".
This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.
Second, and this is significanly more important:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
And in case you didn't understand it the first time, let me repeat it in italics:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
Finally, let me link you to the canonical grimoire on the subject:
You can't parse [X]HTML with a regex
Related
I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info
Sorry about the confusing title. I am trying to figure out a simple Regex problem, but cannot figure out what the solution is.
I have a HTML snippet from a larger HTML document.
<td class="grade">100.0</td>
<td class="teacher">Mathias, Jordan</td>
Other Regex separates the two, giving them those class-names. I use a positive look-ahead to check for a . or a , (period or comma), and assign them the class of grade or teacher (respectively).
The problem comes later, when I want to check if the code in-between these tags is blank.
i.e. : <td class="grade"></td>
I would like to use a positive look-behind to check if the class is either grade or teacher (grade|teacher). In addition, I would like to check that there is truly nothing in between the >< (conjunction of the empty tags).
So-far, this is what I have: (?<=.*(teacher|grade)*.+>?)[^.](?=</td>)
NOTE: This is in Python
Instead of pre-processing your HTML, trust in BeautifulSoup and use regular expression searches:
soup.find_all('td', text=re.compile(','))
finds all <td> elements with the direct text in the tag containing a comma.
So I am trying to write my own scripts that will take in html files and return errors as well as clean them (doing this to learn regex and because I find it useful)
I am starting by having a quick function that takes the document, and grabs all of the tags in the correct order so I can check to make sure that they are all closed...I use the following:
>>> s = """<a>link</a>
... <div id="something">
... <p style="background-color:#f00">paragraph</p>
... </div>"""
>>> re.findall('(?m)<.*>',s)
['<a>link</a>', '<div id="something">', '<p style="background-color:#f00">paragraph</p>', '</div>']
I understand that it grabs everything between the two carrot brackets, and that that becomes the whole line. What would I use to return the following:
['<a>','</a>', '<div id="something">', '<p style="background-color:#f00">','</p>', '</div>']
re.findall('(?m)<.*?>',s)
-- or --
re.findall('(?m)<[^>]*>',s)
The question mark after the * causes it to be a non-greedy match, meaning that it only takes as much as it needs, as opposed to normal, where it takes as much as possible.
The second form is used more often, and it uses a character class to match everything but <, since that will never exist anywhere inside the tag excepting the end.
Although you really shouldn't be parsing HTML with regex, I understand that this is a learning exercise.
You only need to add one more character:
>>> re.findall('(?m)<.*?>',s) # See the ? after .*
['<a>', '</a>', '<div id="something">', '<p style="background-color:#f00">', '</p>', '</div>']
*? matches 0 or more of the preceeding value (in this case, .). This is a lazy match, and will match as few characters as possible.
re.findall('(?m)<[^<^>.]+>',s)
I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.
Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.
I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.
The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->
I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!