file cleaner using regex - python

So I am trying to write my own scripts that will take in html files and return errors as well as clean them (doing this to learn regex and because I find it useful)
I am starting by having a quick function that takes the document, and grabs all of the tags in the correct order so I can check to make sure that they are all closed...I use the following:
>>> s = """<a>link</a>
... <div id="something">
... <p style="background-color:#f00">paragraph</p>
... </div>"""
>>> re.findall('(?m)<.*>',s)
['<a>link</a>', '<div id="something">', '<p style="background-color:#f00">paragraph</p>', '</div>']
I understand that it grabs everything between the two carrot brackets, and that that becomes the whole line. What would I use to return the following:
['<a>','</a>', '<div id="something">', '<p style="background-color:#f00">','</p>', '</div>']

re.findall('(?m)<.*?>',s)
-- or --
re.findall('(?m)<[^>]*>',s)
The question mark after the * causes it to be a non-greedy match, meaning that it only takes as much as it needs, as opposed to normal, where it takes as much as possible.
The second form is used more often, and it uses a character class to match everything but <, since that will never exist anywhere inside the tag excepting the end.

Although you really shouldn't be parsing HTML with regex, I understand that this is a learning exercise.
You only need to add one more character:
>>> re.findall('(?m)<.*?>',s) # See the ? after .*
['<a>', '</a>', '<div id="something">', '<p style="background-color:#f00">', '</p>', '</div>']
*? matches 0 or more of the preceeding value (in this case, .). This is a lazy match, and will match as few characters as possible.

re.findall('(?m)<[^<^>.]+>',s)

Related

Python regex: Difference between (.+) and (.+?)

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!
ps. this is a starbucks stock quote scraper.
import urllib
import re
url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)
print found
.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.
.+? is not -- it stops at the first opportunity.
Examples:
Assume you have this HTML:
<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>
This regex matches the whole thing:
<span id="yfs_l84_sbux">(.+)<\/span>
It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.
But this regex stops at the first </span>:
<span id="yfs_l84_sbux">(.+?)<\/span>
? is a non-greedy modifier. * by default is a greedy repetition operator - it will gobble up everything it can; when modified by ? it becomes non-greedy and will eat up only as much as will satisfy it.
Thus for
<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>
.*?</span> will eat up want, then hit </span> - and this satisfies the regexp with minimal repetitions of ., resulting in <span id="yfs_l84_sbux">want</span> being the match. However, .* will try to see if it can eat more - it will go and find the other </span>, with .*? matching want</span>text<span id="somethingelse">dontwant, resulting in what you got - much more than you wanted.
(.+) is greedy. It takes what it can and gives back when needed.
(.+?) is ungreedy. It takes as few as possible.
See:
delegate
[delegate] /^(.+)e/
[de]legate /^(.+?)e/
Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.

Regex quantifiers

I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.
Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.
re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression
Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

Can I have a non-greedy regex with dotall?

I would like to match dotall and non-greedy. This is what I have:
img(.*?)(onmouseover)+?(.*?)a
However, this is not being non-greedy. This data is not matching as I expected:
<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to
describe a range of nouns, followed by writing a postcard to describe a
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: Drafting </td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"><br>790.0 k<br>
and I cannot understand why.
What I think I am stating in the above regex is:
start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"
Why doesn't this work as I expected?
KEY POINT: dotall must be enabled
It is being non-greedy.
It is your understanding of non-greedy that is not correct.
A regex will always try to match.
Let me show a simplified example of what non-greedy actually means(as suggested by a comment):
re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)
This will match:
as few repetitions of 'a' as possible (in this case 2)
followed by a 'b'
and as few repetitions of 'c' as possible (in this case 0)
so the only match is 'aab'.
And just to conclude:
Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.
First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".
This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.
Second, and this is significanly more important:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
And in case you didn't understand it the first time, let me repeat it in italics:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
Finally, let me link you to the canonical grimoire on the subject:
You can't parse [X]HTML with a regex

python re.search (regex) to search words who have pattern like {{world}} only

I have on HTML file in which I have inserted the custom tags like {{name}}, {{surname}}. Now I want to search the tags who exactly match the pattern like {{world}} only not even {world}}, {{world}, {world}, { word }, {{ world }}, etc.
I wrote the small code for the
re.findall(r'\{(\w.+?)\}', html_string)
It returns the words which follow the pattern {{world}} ,{world},{world}}
that I don't want. I want to match exactly the {{world}}. Can anybody please guide me?
Um, shouldn't the regex be:
'\{\{(\w.+?)\}\}'
Ok, after the comments, I understand your requirements more:
'\{\{\w+?\}\}'
should work for you.
Basically, you want {{any nnumber of word characters including underscore}}. You don't even need the lazy match in this case actually so you may remove th ? in the expression.
Something like {{keyword1}} other stuff {{keyword2}} will not match as a whole now.
To get only the keyword without getting the {{}} use below:
'(?<=\{\{)\w+?(?=\}\})'
How about this?
re.findall('{{(\w+)}}', html_string)
Or, if you want the curly braces included in the results:
re.findall('({{\w+}})', html_string)
If you're trying to accomplish html templating, though, I recommend using a good template engine.
This will match no curly braces within your result, do you want that?
'\{\{(\w[^\{\}]+?)\}\}'
http://rubular.com/r/79YwR13MS0
If you want to match doubled curly brackets, you should specify them in your regex:
re.findall(r'\{\{(\w[^}]?)\}\}', html_string)
You say the other answers don't work, but they seem to for me:
>>> import re
>>> html_string = '{{realword}} {fake1}} {{fake2} {fake3} fake4'
>>> re.findall(r'\{\{(\w.+?)\}\}', html_string)
['realword']
If it doesn't work for you, you'll need to give more details.
Edit: How about the following? Getting rid of the dot (.) and using only \w also allows you to use greedy qualifiers and works for the example HTML from your comment:
>>> html_string = 'html>\n <head>\n </head>\n <title>\n </title>\n <body>\n <h1>\n T - Shirts\n </h1>\n <img src="March-Tshirts/skull_headphones_tshirt.jpg" />\n <img src="/March-Tshirts/star-wars-t-shirts-6.jpeg" />\n <h2>\n we - we - we\n </h2>\n {{unsubscribe}} -- {{tracking_beacon} -- {web_url}} -- {name} \n </body>\n</html>\n'
>>> re.findall(r'\{\{(\w+)\}\}', html_string)
['unsubscribe']
The \w matches alphanumeric characters and the underscore; if you need to match more characters you could add it to a set (e.g., [\w\+] to also match the plus sign).

Categories

Resources