Parse malformed attribute using BeautifulSoup - python

I'm trying to extract an attribute that contains an invalid unescaped quote:
<meta content="mal"formed">
When using BeautifulSoup like this:
soup.find('meta')['content']
And as expected, the result is mal.
Is there a way to make BeautifulSoup treat the unescaped quote as a part of the attribute, so the result will be mal"formed?

After some trial and error using regex, this is my best solution so far:
html = re.sub('(content="[^"=]+)"([^"=]+")', r'\1"\2', html)
soup = BeautifulSoup(html)
soup.find('meta')['content']
Explanation: at first I tried to run the regex only on the desired element. However when doing str(element), BeautifulSoup doesn't return the original html, but a reformatted html which already doesn't contain the formed (the invalid) part of the attribute.
So my solution is based on searching for this exact kind of malformed attributes on the entire HTML, and fixing it using regex. Of course it's very specific to my case.
A better (and hopefully less hackish) solution will be much appreciated.

Here is what I've tried to fix that broken HTML:
different BeautifulSoup parsers - html.parser, html5lib, lxml
lxml.html with and without recover=True
from lxml.html import HTMLParser, fromstring, tostring
data = """<meta content="mal"formed">"""
parser = HTMLParser(recover=True)
print tostring(fromstring(data, parser=parser))
Prints:
<html><head><meta content="mal" formed></head></html>
firing up Firefox and Chrome via selenium and feeding them the broken meta tag:
from selenium import webdriver
data = """<meta content="mal"formed">"""
driver = webdriver.Chrome() # or webdriver.Firefox
driver.get("about:blank")
driver.execute_script("document.head.innerHTML = '{html}';".format(html=data))
data = driver.page_source
driver.close()
print data
Prints:
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="mal" formed"="" /></head><body></body></html>
Different tools interpreted the HTML differently but no tool provided the desired output.
I guess, depending on how much you know about the data, pre-processing it with regular expressions might be a practical solution in this case.

Related

Using BeautifulSoup to Extract CData

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?
from bs4 import BeautifulSoup,CData
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print('CData contents: %r' % cd)
The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:
soup = BeautifulSoup(txt,'html.parser')
For more information on parsers, see the docs
I got onto this by using the diagnose function, which the docs recommend:
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

This question already has answers here:
Don't put html, head and body tags automatically, beautifulsoup
(9 answers)
Closed 9 years ago.
In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:
from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
'<div><b>soup</b></div>'
However with BeautifulSoup4 the same operation creates additional tags:
from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
'<html><body><div><b>soup 4</b></div></body></html>'
^^^^^^^^^^^^ ^^^^^^^^^^^^^^
I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.
If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.
If you know you have a fragment, something like this will give you exactly that fragment:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
return soup4.body.next
elif soup4.html:
return soup4.html.next
else:
return soup4
Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div
If you want to know why this happens:
BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.
As Differences between parsers says:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
So, while this exact difference isn't documented, it's just a special case of something that is.
As was noted in the old BeautifulStoneSoup documentation:
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...
And in the BeautifulSoup4 docs:
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.
Perhaps that will yield what you want.

BeautifulSoup drops children when parent is in an implied namespace

I am building a screenscraper for Nordstrom's website using selenium and BeautifulSoup. The website does not actually have every tag in a namespace, but the Firefox webdriver creates one to avoid ambiguity (specifically, the site has a <html xmlns> tag that seems to confuse the driver).
Thus, everything is put in the namespace a0. However, Beautiful Soup only returns the parent element and (sometimes) one level of children when find() is called.
Take this html for example:
<div class='division'>
<a href='#'>
<img />
</a>
</div>
Everything is in the implied a0 namespace, so we can get the image with:
soup.find('a0:div',{'class':'division'}).find('a0:img')
However, this returns None. I have looked through soup.prettify() and can say with certainty that the a0:img is within the a0:div. Is this an intended feature (in which case I need to find a new way of doing it) or a bug (in which case I need a workaround)?
EDIT:
To avoid confusion, this is an example demonstrating the entire workflow:
from selenium import webdriver
from BeautifulSoup import BeautifulSoup # Note that this is BeautifulSoup 3
b = webdriver.Firefox()
b.get("http://shop.nordstrom.com/c/womens-skirts")
borscht = BeautifulSoup(b.page_source)
theImageThatCannotBeFound = borscht.find('a0:div',{'class':'fashion-item'}).find('a0:img')
The above code sets theImageThatCannotBeFound to None, which I believe is incorrect. I hope this clarifies.
This worked for me.
import urllib
from BeautifulSoup import BeautifulSoup
url = 'http://shop.nordstrom.com/c/womens-skirts'
fp = urllib.urlopen(url)
soup = BeautifulSoup(fp)
print soup.find('div',{'class':'fashion-item'}).findAll('img') # also tried .find
Try excludinga0:. That's what seems to be your problem.
EDIT:
Using both Chrome and Firefox browsers in and out of Selenium, xmlns is set to an empty string when I view it, which is why the above code works for me. It seems that by mismatch of some component somewhere, we're not getting the same results, and you're getting the namespace a0:.
Because I cannot reproduce the situation, the only solution I can find (albeit very hacky), is to manually replace the namespace:
source = browser.page_source.replace('a0:div','div')
soup = BeautifulSoup(source)
print soup.find('div',{'class':'fashion-item'}).find('img')
I'll admit its not exactly an ideal solution. I'll keep looking and update my answer if I find a more elegant fix.

Beautifulsoup, Python and HTML automatic page truncating?

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.
I use the following code to get the set of "div"s:
findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
print it
At a certain point, the output looks like:
correct string, correct string, incomplete/truncated string ("So, I")
although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.
Do you have an idea how can I fix this issue?
Thanks!
Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.
import lxml.html
doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))
Code above returns 131 divs.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(html, 'html5lib')

Categories

Resources