Using function prettify() I can print the html code very well formated, and I have read that this function prints even a broken html code properly (for example if the tag is opened but never closed, prettify helps to fix that). But only this function can do that or after loading the data to Beautiful Soup object like:
soup = BeautifulSoup(data),
causes that now the soup contains a code which is resistant for a broken html code.
For example if I have a broken code:
<body>
<p><b>Paragraph.</p>
</body>
and I will load it to the BS object it is seen inside of soup object as above or as a fixed one?:
<body>
<p><b>Paragraph.</b></p>
</body>
The HTML marekup is corrected at the time of creating the soup, not at the time of pretty-printing it. This is needed so that BeautifulSoup can navigate the document correctly.
As you can see below, the string representation of the soup contains corrected markup:
>>> from bs4 import BeautifulSoup
>>> text="""<body>
... <p><b>Paragraph.</p>
... </body>
... """
>>> soup = BeautifulSoup(text)
>>> str(soup)
'<body>\n<p><b>Paragraph.</b></p>\n</body>\n'
>>>
If you read the source for class BeautifulStoneSoup, you will find the following comment which addresses your broken markup:
This class contains the basic parser and search code. It defines
a parser that knows nothing about tag behavior except for the
following:
You can't close a tag without closing all the tags it encloses.
That is, "<foo><bar></foo>" actually means
"<foo><bar></bar></foo>".
And then further down the source, you can see that BeautifulSoup inherits from BeautifulStoneSoup.
Related
I'm trying to scrape a website with BeautifulSoup and have written the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
However, this returns a NoneType, even though the div exists with the correct ID on the website. Is there anything I'm doing wrong?
I'm seeing the div with the id main on the page:
I also find the div main when I print soup:
This is briefly covered in BeautifulSoup's documentation
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ ... ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing </p> tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
The issue you are experiencing is likely due to malformed HTML that html.parser is not able to handle appropriately. This resulted in id="main" being stripped when BeautifulSoup parsed the HTML. By changing the parser to either html5lib or lxml, BeautifulSoup handles malformed HTML differently than html.parser
I use lxml.html to parse various html pages. Now i recognised that at least for some pages it doesn't find the body tag despite it is present and beautiful soup finds it (even though it uses lxml as parser).
example page: https://plus.google.com/ (what remains of it)
import lxml.html
import bs4
html_string = """
... source code of https://plus.google.com/ (manually copied) ...
"""
# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')
# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')
any guess about what is happening here is welcome :)
Update:
The problem seems to be related to the encoding.
# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')
You can use something like this:
import requests
import lxml.html
html_string = requests.get("https://plus.google.com/").content
body = lxml.html.document_fromstring(html_string).find('body')
body variable contains body html element
I'm trying to scrape a website with BeautifulSoup and have written the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
However, this returns a NoneType, even though the div exists with the correct ID on the website. Is there anything I'm doing wrong?
I'm seeing the div with the id main on the page:
I also find the div main when I print soup:
This is briefly covered in BeautifulSoup's documentation
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ ... ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing </p> tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
The issue you are experiencing is likely due to malformed HTML that html.parser is not able to handle appropriately. This resulted in id="main" being stripped when BeautifulSoup parsed the HTML. By changing the parser to either html5lib or lxml, BeautifulSoup handles malformed HTML differently than html.parser
I am trying to parse some data in an XML file that contains HTML in its description field.
For example, the data looks like:
<xml>
<description>
<body>
HTML I want
</body>
</description
<description>
<body>
- more data I want -
</body>
</description>
</xml>
So far, what I've come up with is this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
bodies = i.find_all('body')
# This will return an object of type 'ResultSet'
for n in bodies:
print n
# Nothing prints here.
I'm not sure where I'm going wrong; when I enumerate the entries in descContent it shows the content I'm looking for; the tricky part is getting in to the nested entries for <body>. Thanks for looking!
EDIT: After further playing around, it seems that BeautifulSoup doesn't recognize that there is HTML in the <description> tag - it appears as just text, hence the problem. I'm thinking of saving the results as an HTML file and reparsing that, but not sure if that will work, as saving contains the literal strings for all the carriage returns and new lines...
use xml parser in lxml
you can install lxml parser with
pip install lxml
with open("file.html") as fp:
soup = BeautifulSoup(fp, 'xml')
for description in soup.find_all('description'):
for body in description.find_all('body'):
print body.text.replace('-', '').replace('\n', '').lstrip(' ')
or u can just type
print body.text
I am trying to parse a HTML document using BeautifulSoup with Python.
But it stops parsing at special characters, like here:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&##???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html.parser')
print(soup)
This code should output the whole document. Instead, it prints only
<html>
<body>
<div>And I said «What the %</div></body></html>
The rest of the document is apparently lost. It was stopped by the combination '&#'.
The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?
I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.
Update. The method soup.prettify() does not work, because the soup is already broken.
You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup object. For example:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&##???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html5lib')
# different parser ^
Now, if you'll print soup it will display your desired string:
>>> print(soup)
<html><head></head><body>
<div>And I said «What the %&##???»</div>
<div>some other text</div>
</body></html>
From the Difference Between Parsers document:
Unlike html5lib, html.parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.