BeautifulSoup parse XML with HTML content

BeautifulSoup parse XML with HTML content - python

I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. &lt ;) are not present anymore. For instance:
'<' in raw_text # True
'<' in str(BeautifulSoup(raw_text, 'xml')) # False
I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:
raw_text = '<xmltag><t><p>test</p><t><xmltag>'
soup = BeautifulSoup(raw_text, 'xml')
'<' in str(soup) # True
So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:
with open('test.xml', 'r') as fp:
raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints: div div style="margin-left:0pt;margin-righ
# original file: <div> <div style=

Solutions using simplifieddoc
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t><p>test</p></t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())
result:
<p>test</p>
<p>test</p>
<p>test</p>

Try to use another parser for XBRL, i.e. python-xbrl
Check this link- Xbrl parser written in Python

Related

How to build subtree with Python, BeautifulSoup?

I'm attempting to use BeautifulSoup to compose a webpage.
When I go to set a tag's inner content via string it automatically escapes the string. I have yet to locate a technique, like a html method/attribute, where BS won't auto escape everything.
from bs4 import BeautifulSoup
f = open("template.html", "r")
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
x = soup.find("div", id="example")
x.string("<div>example</div>")
# x's contents...
# <div id="example"><div>example</div></div>
It's apparent that BS is more often used for scraping HTML than building HTML – is there a common library for building out?

You should try Jinja. Then you can render templates like this:
from jinja2 import Template
t = Template('<div id="example">{{example_div}}</div>')
t.render(example_div='<div>example</div>')
Resulting in:
'<div id="example"><div>example</div></div>'
Of course, you can also read the template from a file:
with open('template.html', 'r') as f:
t = Template(f.read())

Parsing HTML nested within XML file (using BeautifulSoup)

I am trying to parse some data in an XML file that contains HTML in its description field.
For example, the data looks like:
<xml>
<description>
<body>
HTML I want
</body>
</description
<description>
<body>
- more data I want -
</body>
</description>
</xml>
So far, what I've come up with is this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
bodies = i.find_all('body')
# This will return an object of type 'ResultSet'
for n in bodies:
print n
# Nothing prints here.
I'm not sure where I'm going wrong; when I enumerate the entries in descContent it shows the content I'm looking for; the tricky part is getting in to the nested entries for <body>. Thanks for looking!
EDIT: After further playing around, it seems that BeautifulSoup doesn't recognize that there is HTML in the <description> tag - it appears as just text, hence the problem. I'm thinking of saving the results as an HTML file and reparsing that, but not sure if that will work, as saving contains the literal strings for all the carriage returns and new lines...

use xml parser in lxml
you can install lxml parser with
pip install lxml
with open("file.html") as fp:
soup = BeautifulSoup(fp, 'xml')
for description in soup.find_all('description'):
for body in description.find_all('body'):
print body.text.replace('-', '').replace('\n', '').lstrip(' ')
or u can just type
print body.text

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div> Below is my code, but doesn't seem to be working correctly:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
Would you help check which part has the wrong code and maybe how should I approach it?
Thank you very much!!

You can use the .decompose() method to remove the element/tag:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
As an alternative, based on the comments below:
If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.
Depending on the desired output, here are a few potential options:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)

Get all HTML tags with Beautiful Soup

I am trying to get a list of all html tags from beautiful soup.
I see find all but I have to know the name of the tag before I search.
If there is text like
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""
How would I get a list like
list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]
I know how to do this with regex, but am trying to learn BS4

You don't have to specify any arguments to find_all() - in this case, BeautifulSoup would find you every tag in the tree, recursively.
Sample:
from bs4 import BeautifulSoup
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")
print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']
print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

Please try the below--
for tag in soup.findAll(True):
print(tag.name)

I thought I'd share my solution to a very similar question for those that find themselves here, later.
Example
I needed to find all tags quickly but only wanted unique values. I'll use the Python calendar module to demonstrate.
We'll generate an html calendar then parse it, finding all and only those unique tags present.
The below structure is very similar to the above, using set comprehensions:
from bs4 import BeautifulSoup
import calendar
html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
# Result
# {'table', 'td', 'th', 'tr'}

If you want to find some specific HTML tags then try this:
html = driver.page_source
# driver.page_source: "<div>something</div>\n<div>something else</div>\n<div class='magical'>hi there</div>\n<p>ok</p>\n"
soup = BeautifulSoup(html)
for tag in soup.find_all(['a','div']): # Mention HTML tag names here.
print(tag.text)
# Result:
# something
# something else
# hi there

Here is an efficient function that I use to parse different HTML and text documents:
def parse_docs(path, format, tags):
"""
Parse the different files in path, having html or txt format, and extract the text content.
Returns a list of strings, where every string is a text document content.
:param path: str
:param format: str
:param tags: list
:return: list
"""
docs = []
if format == "html":
for document in tqdm(get_list_of_files(path)):
# print(document)
soup = BeautifulSoup(open(document, encoding='utf-8').read())
text = '\n'.join([''.join(s.findAll(text=True)) for s in
soup.findAll(tags)]) # parse all <p>, <div>, and <h> tags
docs.append(text)
else:
for document in tqdm(get_list_of_files(path)):
text = open(document, encoding='utf-8').read()
docs.append(text)
return docs
a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div']) will return a list of text strings.

Beautiful Soup find_all() returns odd tags instead of results

I'm using Beautiful Soup to get some information out of an XML file that looks like this:
<name>Ted</name>
<link>example.com/rss</link>
<link>example2.com/rss</link>
That is the entirety of the XML file that I am trying to read in at the moment, for test purposes.
When I try to use find_all('link') it returns a list that consists of this:
[ <link/>, <link/> ]
I can't seem to find any mention of something like this in any documentation, anyone able to tell me what I'm doing wrong?
EDIT: Including the code for parsing:
for file in glob.glob("*.xml"):
if file.endswith(".xml"):
f = open(file, 'r');
#Reads in all information about the bot from the file
botFile = f.read()
soup = BeautifulSoup(botFile)
name = soup.find('name').get_text()
links = soup.find_all('link')
for link in links:
print link

To parse XML with BeautifulSoup you need to use the XML parser; make sure you have lxml installed and tell BeautifulSoup to use XML:
soup = BeautifulSoup(document, 'xml')
otherwise the elements are parsed as HTML <link> tags, which are empty by definition.
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <root>
... <name>Ted</name>
... <link>example.com/rss</link>
... <link>example2.com/rss</link>
... </root>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.find_all('link')
[<link/>, <link/>]
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup.find_all('link')
[<link>example.com/rss</link>, <link>example2.com/rss</link>]
Note that without the second argument 'xml' the results are empty tag objects, but with 'xml' set the tag contents are there.
See Installing a parser and Parsing XML in the documentation.

Beautiful Soup documentation mentions that it can't handle xml files properly. There is a module called BeautifulStoneSoup that handles xml files. It is a basic module and nothing fancy about it. However, if your file is a simple xml then it may very well do the job.
Here is the link to its documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup parse XML with HTML content - python

Solutions using simplifieddoc from simplified_scrapy.simplified_doc import SimplifiedDoc doc = SimplifiedDoc('<xmltag><t><p>test</p></t></xmltag>') print (doc.t.html) print (doc.xmltag.t.html) print (doc.t.unescape()) result: <p>test</p> <p>test</p> <p>test</p>

Try to use another parser for XBRL, i.e. python-xbrl Check this link- Xbrl parser written in Python

Related

How to build subtree with Python, BeautifulSoup?

Parsing HTML nested within XML file (using BeautifulSoup)

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

Get all HTML tags with Beautiful Soup

Beautiful Soup find_all() returns odd tags instead of results

Categories

Resources