Why does BeautifulSoup reformat my XML?

Why does BeautifulSoup reformat my XML? - python

I do the following:
from BeautifulSoup import *
html = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(html)
soup.contents
As a result I get:
[<body><b>In Body</b><b>Second level</b></body>]
It looks strange to me since I see not the original XML. Originally I have a tag <b> that contains some text (In Body) and then it contains another tag <b>. However, the BeautifulSoup "thinks" that I have tag <b> and after it (after it is closed) I have another tag <b>. So, the tags are not perceived as nested into each other. Why is that?
ADDED
For the people who complain about validity of the HTML in my example I made the following example:
xml = u'<aaa><bbb>In Body<bbb>Second level</bbb></bbb></aaa>'
soup = BeautifulSoup(xml)
soup.contents
which returns:
[<aaa><bbb>In Body</bbb><bbb>Second level</bbb></aaa>]
ADDED 2
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1344, in unknown_starttag
and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
AttributeError: 'list' object has no attribute 'text'

Note that you're using the obsolete package, BeautifulSoup:
This package is OBSOLETE. It has been replaced by the beautifulsoup4
package. You should use Beautiful Soup 4 for all new projects
BeautifulSoup 3 contained some XML parsing features (the BeautifulStoneSoup) that really did not understand the same tag being nested again (as noted by 7stud in his answer; thus for all XML parsing needs it should be totally and utterly considered replaced by BeautifulSoup 4. Note that these packages can coexist even within an application - BeautifulSoup.BeautifulSoup for BS3, and bs4.BeautifulSoup for BS4.
BeautifulSoup 4 parses using HTML rules by default; you need to tell it explicitly to use XML (requires the lxml to be installed). Thus an example with BeautifulSoup 4 (PyPI beautifulsoup4):
>>> from bs4 import BeautifulSoup
>>> xml = u'<body><b>In Body<b>Second level</b></b></body>'
>>> soup = BeautifulSoup(xml, 'xml')
>>> soup.contents
[<body><b>In Body<b>Second level</b></b></body>]
>>> bs4.__version__
'4.1.3'
Notice that then the document must be well-formed XML; no leniency.
If you do not use the 'xml' argument, you will get incorrectly parsed documents:
>>> bs4.BeautifulSoup('<p><p></p></p>')
<html><body><p></p><p></p></body></html>
and with
>>> bs4.BeautifulSoup('<p><p></p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p/></p>

So, the tags are not perceived as nested into each other. Why is that?
According to the comments in the BeautifulSoup source code:
Tag nesting rules:
Most tags can't be nested at all. For instance, the occurance of a
<p> tag should implicitly close the previous <p> tag.
<p>Para1<p>Para2
should be transformed into:
<p>Para1</p><p>Para2
The source code then specifies several lists containing tag names that according to the HTML standard are allowed to nest within themselves--and <b> isn't one of them.
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
AttributeError: 'list' object has no attribute 'text'
You are getting that error because you can't pass a list as an argument to BeautifulSoup().
In order to alert BeautifulSoup that you aren't parsing html, you need to use BeautifulStoneSoup(). Unfortunately, my tests show that BeautifulStoneSoup() produces the same xml, so it appears that BeautifulStoneSoup() applies a similar nesting rule to your <b> tag.
If you aren't locked into using BeautifulSoup 3, you should use lxml or BeautifulSoup 4. lxml is considered by many to be the superior package (e.g. it's faster, you can use xpaths), but it can be tough to install. So I suggest you try to install lxml, and if that works, then great. Otherwise, install BeautifulSoup 4.
I've been using BeautifulSoup for so many years, that I prefer it; but I also use lxml when I want to use xpaths to search a document.
lxml example:
from lxml import etree
xml = '<body><b>In Body<b>Second level</b></b></body>'
tree = etree.fromstring(xml)
print etree.tostring(tree)
matching_tags = tree.xpath('/body/b/b')
inner_b_tag = matching_tags[0]
print inner_b_tag.text
--output:--
<body><b>In Body<b>Second level</b></b></body>
Second level
bs4 example:
from bs4 import BeautifulSoup
xml = '<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, 'xml') #In BeautifulSoup 4, you pass a second argument to BeautifulSoup() to indicate that you are parsing xml.
print(soup)
body = soup.find('body')
inner_b_tag = body.b.b
print(inner_b_tag.string)
--output:--
<?xml version="1.0" encoding="utf-8"?>
<body><b>In Body<b>Second level</b></b></body>
Second level

Related

'NoneType' object has no attribute 'text' - Beautifulsoup [duplicate]

I'm trying to scrape a website with BeautifulSoup and have written the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
However, this returns a NoneType, even though the div exists with the correct ID on the website. Is there anything I'm doing wrong?
I'm seeing the div with the id main on the page:
I also find the div main when I print soup:

This is briefly covered in BeautifulSoup's documentation
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ ... ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing </p> tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
The issue you are experiencing is likely due to malformed HTML that html.parser is not able to handle appropriately. This resulted in id="main" being stripped when BeautifulSoup parsed the HTML. By changing the parser to either html5lib or lxml, BeautifulSoup handles malformed HTML differently than html.parser

Parsing HTML nested within XML file (using BeautifulSoup)

I am trying to parse some data in an XML file that contains HTML in its description field.
For example, the data looks like:
<xml>
<description>
<body>
HTML I want
</body>
</description
<description>
<body>
- more data I want -
</body>
</description>
</xml>
So far, what I've come up with is this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
bodies = i.find_all('body')
# This will return an object of type 'ResultSet'
for n in bodies:
print n
# Nothing prints here.
I'm not sure where I'm going wrong; when I enumerate the entries in descContent it shows the content I'm looking for; the tricky part is getting in to the nested entries for <body>. Thanks for looking!
EDIT: After further playing around, it seems that BeautifulSoup doesn't recognize that there is HTML in the <description> tag - it appears as just text, hence the problem. I'm thinking of saving the results as an HTML file and reparsing that, but not sure if that will work, as saving contains the literal strings for all the carriage returns and new lines...

use xml parser in lxml
you can install lxml parser with
pip install lxml
with open("file.html") as fp:
soup = BeautifulSoup(fp, 'xml')
for description in soup.find_all('description'):
for body in description.find_all('body'):
print body.text.replace('-', '').replace('\n', '').lstrip(' ')
or u can just type
print body.text

Beautiful Soup find_all() returns odd tags instead of results

I'm using Beautiful Soup to get some information out of an XML file that looks like this:
<name>Ted</name>
<link>example.com/rss</link>
<link>example2.com/rss</link>
That is the entirety of the XML file that I am trying to read in at the moment, for test purposes.
When I try to use find_all('link') it returns a list that consists of this:
[ <link/>, <link/> ]
I can't seem to find any mention of something like this in any documentation, anyone able to tell me what I'm doing wrong?
EDIT: Including the code for parsing:
for file in glob.glob("*.xml"):
if file.endswith(".xml"):
f = open(file, 'r');
#Reads in all information about the bot from the file
botFile = f.read()
soup = BeautifulSoup(botFile)
name = soup.find('name').get_text()
links = soup.find_all('link')
for link in links:
print link

To parse XML with BeautifulSoup you need to use the XML parser; make sure you have lxml installed and tell BeautifulSoup to use XML:
soup = BeautifulSoup(document, 'xml')
otherwise the elements are parsed as HTML <link> tags, which are empty by definition.
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <root>
... <name>Ted</name>
... <link>example.com/rss</link>
... <link>example2.com/rss</link>
... </root>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.find_all('link')
[<link/>, <link/>]
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup.find_all('link')
[<link>example.com/rss</link>, <link>example2.com/rss</link>]
Note that without the second argument 'xml' the results are empty tag objects, but with 'xml' set the tag contents are there.
See Installing a parser and Parsing XML in the documentation.

Beautiful Soup documentation mentions that it can't handle xml files properly. There is a module called BeautifulStoneSoup that handles xml files. It is a basic module and nothing fancy about it. However, if your file is a simple xml then it may very well do the job.
Here is the link to its documentation.

How can I Extract Specific xml tags from a local xml file using python?

I'm pretty new to interacting with xml, python, and scraping data so bear with me please:
I've got an xml file with my notes saved from evernote. I have been able to load BeautifulSoup and lxml into my python environment. I have also been able to load the xml file and print
Heres my code up until print:
from bs4 import BeautifulSoup
from xml.dom.minidom import parseString
file = open('myNotes.xml','r')
data = file.read()
dom = parseString(data)
print data.toxml()
I didn't include the actual printed file as it contains lots of base 64 code.
What I am trying to accomplish is to extract select xml tags and print them to a new file... help!

This is how to use BeautifulSoup to print xml
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
print(soup.prettify())
And to write it to a file:
with open("file.txt", "w") as f:
f.write(soup.prettify())
Now, to extract all of a certain type of tag to a list:
# Extract all of the <a> tags:
tags = soup.find_all('a')

SoupStrainer with encoding

I wrote the line below:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
The data is achieved by urllib.urlopen(XXX).read() in python2.7.
It works well when the XXX is a page that consists of total English characters, such as http://python.org. But when it goes for a page there is some Chinese characters, it fails.
There will be a KeyError. And [x for ...] returns an empty list.
What's more, if there is no parseOnlyThese=SoupStrainer('a'), it is OK for both.
Is there some bug of SoupStrainer?
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
data = urllib.urlopen('http://tudou.com').read()
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
gives the traceback:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
File "F:\ActivePython27\lib\site-packages\beautifulsoup-3.2.1-py2.7.egg\BeautifulSoup‌.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'

There are <a> links on that page that do not have a href attribute. Use the following instead:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a')) if x.has_key('href')]
For example, it is perfectly normal to declare a link target with <a name="something" />; you are selecting those tags too, but they do not have a href attribute and your code fails on that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does BeautifulSoup reformat my XML? - python

Related

'NoneType' object has no attribute 'text' - Beautifulsoup [duplicate]

Parsing HTML nested within XML file (using BeautifulSoup)

Beautiful Soup find_all() returns odd tags instead of results

How can I Extract Specific xml tags from a local xml file using python?

SoupStrainer with encoding

Categories

Resources