Why does BeautifulSoup remove all the formatting from my HTML? - python

I have an HTML file which looks more or less like
<body>
<div>
<aside class="bg">
Home
</aside>
</div>
</body>
But after I parse it with BeautifulSoup and then write it to file, all my formatting is gone. My code looks like:
with open('contact.html', 'r') as f:
soup = BeautifulSoup(f, "html.parser")
elem = soup.find("aside")
new_html = "Support"
new_soup = BeautifulSoup(new_html, "html.parser")
elem.insert(1, newsoup)
with open('contact.html', 'w') as f:
f.write(str(soup))
The resulting html file looks like
<body>
<div>
<aside class="bg">
Support
Home
</aside>
</div>
</body>
I don't want to use prettify because I dislike the formatting of it. I just want to keep my formatting the same. Any way I can do that?

there is a discussion in this thread
Maintaining the indentation of an XML file when parsed with Beautifulsoup
hope this helps

Related

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

How can I find a comment with specified text string

I'm using robobrowser to parse some html content. I has a BeautifulSoup inside. How can I find a comment with specified string inside
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
In fact I need to get TEXT_1 if I know ANY_ID
Thanks
Use the text argument and check the type to be Comment. Then, load the contents with BeautifulSoup again and find the desired element by id:
from bs4 import BeautifulSoup
from bs4 import Comment
data = """
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment) and "ANY_ID" in text)
soup_comment = BeautifulSoup(comment, "html.parser")
text = soup_comment.find("div", id="ANY_ID").get_text()
print(text)
Prints TEXT_1.

Parsing Html with python using lxml

I have this html page:
<html>
<head></head>
<body>
Some Text
<a href="aLink">
Other Text
</a>
<a href="aLink2.html">
Another Text
</a>
</body>
</html>
I am interested in capturing the 3 texts in the file. By doing this, i get in output the texts from the two links:
from lxml import html
from lxml import etree
import requests
page = requests.get('myUrl')
tree = html.fromstring(page.text)
aLink = tree.xpath('//a')
for link in aLink:
print link.text #it works
But I am not able to obtain the text from the body section, since the following code doesn't work:
body = tree.xpath('//body')
print body.text
What am I supposed to do? Thanks for the answers

python html parsing

i need do some html parsing use python .if i have a html file like bellow:
《body》
《div class="mydiv"》
《p》i want got it《/p》
《div》
《p》 good 《/p》
《a》 boy 《/a》
《/div》
《/div》
《/body》
how can i get the content of 《div class="mydiv"》 ,say , i want got .
《p》i want got it《/p》
《div》
《p》 good 《/p》
《a》 boy 《/a》
《/div》
i have try HTMLParser, but i fount it can't.
anyway else ? thanks!
With BeautifulSoup it is as simple as:
from BeautifulSoup import BeautifulSoup
html = """
<body>
<div class="mydiv">
<p>i want got it</p>
<div>
<p> good </p>
<a> boy </a>
</div>
</div>
</body>
"""
soup = BeautifulSoup(html)
result = soup.findAll('div', {'class': 'mydiv'})
tag = result[0]
print tag.contents
[u'\n', <p>i want got it</p>, u'\n', <div>
<p> good </p>
<a> boy </a>
</div>, u'\n']
Use lxml. Or BeautifulSoup.
I would prefer lxml.html.
import lxml.html as H
doc = H.fromstring(html)
node = doc.xpath("//div[#class='mydiv']")

Categories

Resources