How to parse HTML to a string template in Python?

How to parse HTML to a string template in Python? - python

I want to parse HTML and turn them into string templates. In the example below, I seeked out elements marked with x-inner and they became template placeholders in the final string. Also x-attrsite also became a template placeholder (with a different command of course).
Input:
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
Desired output:
<div class="x,y,z" {attrsite}>{inner}<div>{inner}</div></div>
I know there is HTMLParser and BeautifulSoup, but I am at a loss on how to extract the strings before and after the x-* markers and to escape those strings for templating.
Existing curly braces are handled sanely, like this sample:
<div x-maybe-highlighted> The template string "there are {n} message{suffix}" can be used.</div>

BeautifulSoup can handle the case:
find all div elements with x-attrsite attribute, remove the attribute and add {attrsite} attribute with a value None (produces an attribute with no value)
find all div elements with x-inner attribute and use replace_with() to replace the element with a text {inner}
Implementation:
from bs4 import BeautifulSoup
data = """
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for div in soup.find_all('div', {'x-attrsite': True}):
del div['x-attrsite']
div['{attrsite}'] = None
for div in soup.find_all('div', {'x-inner': True}):
div.replace_with('{inner}')
print(soup.prettify())
Prints:
<div class="x,y,z" {attrsite}>
{inner}
<div>
{inner}
</div>
</div>

Related

beautifulsoup filtering descendents using regex

I'm trying to use beautiful-soup to return elements of the DOM that contain children that match filtering criteria.
In the example below,I want to return both divs based on finding a regex match in a child element.
<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>
The basic code setup is as follows:
from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')
How do I add a regex test that evaluates the children of the target div? I.e, how would I add the regex call below to the 'find' or 'find_all' functions of beautiful-soup?
re.compile('regexmatch\d')

The approach I landed with was find_parent, which will return the parent element of the beautifulsoup results regardless of the method used to find the original result (regex or otherwise). For the example above:
childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()
...modified with the loop of your choice to cycle through all the members of childOfResult

Get the first div then run for loop on all div's
Example
from bs4 import BeautifulSoup
html = """<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>"""
page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')
for element in elements:
print(element.select("span:nth-child(1)")[0].text)
it prints out
regexmatch1
regexmatch2

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?

The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>

To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Python Beautifulsoup Find_all except

I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?

Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.

BS4 Searching by Class_ Returning Empty

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>

try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse HTML to a string template in Python? - python

Related

beautifulsoup filtering descendents using regex

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

Delete block in HTML based on text

Python Beautifulsoup Find_all except

BS4 Searching by Class_ Returning Empty

Categories

Resources