Preserve the formatting in beautifulsoup

Preserve the formatting in beautifulsoup - python

import bs4
foo = """<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<p>
This is a paragraph1
</p>
<h2>
This is heading2
</h2>
</body>
</html>"""
def remove_p(text):
obj = bs4.BeautifulSoup(text, features="html.parser")
for tag in obj.find_all("p"):
tag.decompose()
return str(obj)
foo = remove_p(foo)
print(foo)
beautifulsoup4 4.11.0
bs4 0.0.1
bs4 inserts blank lines corresponding to <p>. I expected entries corresponding to <p> tag to be deleted - no blank lines.
bs4 removes the leading spaces for opening tags. However, it doesn't remove leading spaces for closing tags </h2> and text.
I would like the function to return text with <p> entries removed without modifying the formatting. Please suggest.
Actual output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
Expected Output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
EDIT:
Thanks for all the suggestions to use prettify(). I have already tried using prettify() but it completely changes the formatting of the document. Excuse me for not mentioning it to start with.
To add some context, we receive these documents from our upstream, and we are supposed to just delete some nodes without changing the formatting.

This is not exactly what you want, but there is a way to prettify the code: use obj.prettify() instead of str(obj)

You can use the function Prettify that is built into BeautifulSoup
here is an example shown from the documentation of BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>

Related

Scraping string from HTML with python3-beautifulsoup3

I'm trying to get string from a table row using beautifulsoup.
String I want to get are 'SANDAL' and 'SHORTS', from second and third rows.
I know this can be solved with regular expression or with string functions but I want to learn beautifulsoup and do as much as possible with beautifulsoup.
Clipped python code
soup=beautifulsoup(page,'html.parser')
table=soup.find('table')
row=table.find_next('tr')
row=row.find_next('tr')
HTML
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>

To get text from first column of the table (sans header), you can use this script:
from bs4 import BeautifulSoup
txt = '''
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>'''
soup = BeautifulSoup(txt, 'lxml') # <-- lxml is important here (to parse the HTML code correctly)
for tr in soup.find('table', id='products').find_all('tr')[1:]: # <-- [1:] because we want to skip the header
print(tr.td.text) # <-- print contents of first <td> tag
Prints:
SANDAL
SHORTS

Python - %paste

I am following a tutorial where the teacher pastes in the html inline with our scrappy shell via: %paste ( the html below)
html_doc = " "
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
" "
but I get this error:
<html>
File "<console>", line 1
<html>
SyntaxError: invalid syntax
I have imported tkinter, and looked up other reasources but cant figure out how to get html inline.

Try doing """:
html_doc = """
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
"""

beautifulsoup, find text using re.compile

Why is this not finding anything? I'm looking to extract the id out of this html.
from bs4 import BeautifulSoup
import re
a="""
<html lang="en-US">
<head>
<title>
Coverage
</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="2017-07-12T08:12:00.0000000" name="created"/>
</head>
<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
<div id="div:{1586118a-0184-027e-07fc-99debbfc309f}{35}" style="position:absolute;left:1030px;top:449px;width:624px">
<p id="p:{dd73b86c-408c-4068-a1e7-769ad024cf2e}{40}" style="margin-top:5.5pt;margin-bottom:5.5pt">
{FB} 2 Facebook 465.8 /
<span style="color:green">
12
</span>
<span style="color:green">
5
</span>
<span style="color:green">
10
</span>
<span style="color:red">
-3
</span>
/ updated
</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(a,'html.parser')
ticker='{FB}'
target= soup.find('p', text = re.compile(ticker))
There is more than one p i just omitted the rest. I need the text= part
I've also tried the wildcards (.*) but still can get it to work.
I must get the id by ticker... i don't know anything else and the rest of the page is dynamic

This would get the "id" value for <p> tags which contains the text "{FB}":
ticker='{FB}'
target= soup.find_all('p')
for items in target:
check=items.text
if '{FB}' in check:
print (items.get("id"))
More compact way:
for elem in soup(text=re.compile(ticker)):
print (elem.parent.get("id"))

Wrap found tag inside new tag in bs4

How to wrap a tag with new tag in bs4.
for example I have html like this.
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>
I want to convert it to this.
<html>
<body>
<b><p>Demo</p></b>
<b> <p>world</p> </b>
</body>
</html>
Here is Exemplification.
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>"""
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('p'):
# wrap tag with '<b>'

Document:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>"""
soup = BeautifulSoup(html, 'html.parser')
for p in soup('p'): # shortcut for soup.find_all('p')
p.wrap(soup.new_tag("b"))
out:
<html>
<body>
<b><p>Demo</p></b>
<b><p>world</p></b>
</body>
</html>

Adding elements to BeautifulSoup's find_all list as a string

I am testing a webscraping concept with BeautifulSoup's findall() function. I'm trying to get the contents of the p tags that have the class='first' inside of div class='dinner'.
from bs4 import BeautifulSoup
import urllib2
html_doc="""
<html>
<head>
<title>The practice html document</title>
</head>
<body>
<div class='dinner'>
<p class='first'>I like pizza</p>
<p class='second'>I really like pizza</p>
<p class='first'>pizza is good</p>
</div>
<div class='breakfast'>
<p class='first'>pancake</p>
</div>
<div class='lunch'>
<p> This is a paragraph</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(html_doc)
div_stuff=soup.find("div", attrs={'class':'dinner'})
print div_stuff
print '\n'
#This prints the paragraphs only in the div with the class dinner
div_paragraphs=unicode(div_stuff.find_all('p', attrs={'class':'first'}))
print div_paragraphs
The findall function puts the paragraphs it finds as an element in a list. This is the output of the code:
<div class="dinner">
<p class="first">I like pizza</p>
<p class="second">I really like pizza</p>
<p class="first">pizza is good</p>
</div>
[<p class="first">I like pizza</p>, <p class="first">pizza is good</p>]
The goal is to get just the content of the paragraphs as strings in the list. Like this:
[I like pizza,pizza is good]
I could make some code that would go through each element and replace them after it has found all instances, but I wanted to see if there is a way to make them strings before findall stores each one into the list.

.findall() will return matches; you are looking for the elements, not for the contained text (which would be a very different search).
You can easily extract the text in a list comprehension:
[elem.get_text() for elem in soup.select('div.dinner p.first')]
I used a CSS selector here to match the p tags in context of their div parents.
Demo:
>>> from bs4 import BeautifulSoup
>>> html_doc="""
... <html>
... <head>
... <title>The practice html document</title>
... </head>
... <body>
... <div class='dinner'>
... <p class='first'>I like pizza</p>
... <p class='second'>I really like pizza</p>
... <p class='first'>pizza is good</p>
... </div>
... <div class='breakfast'>
... <p class='first'>pancake</p>
... </div>
... <div class='lunch'>
... <p> This is a paragraph</p>
... </div>
... </body>
... </html>
... """
>>> soup = BeautifulSoup(html_doc)
>>> [elem.get_text() for elem in soup.select('div.dinner p.first')]
[u'I like pizza', u'pizza is good']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserve the formatting in beautifulsoup - python

This is not exactly what you want, but there is a way to prettify the code: use obj.prettify() instead of str(obj)

Related

Scraping string from HTML with python3-beautifulsoup3

Python - %paste

beautifulsoup, find text using re.compile

Wrap found tag inside new tag in bs4

Adding elements to BeautifulSoup's find_all list as a string

Categories

Resources