Delete block in HTML based on text

Delete block in HTML based on text - python

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>

To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything

Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:

Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

beautifulsoup filtering descendents using regex

I'm trying to use beautiful-soup to return elements of the DOM that contain children that match filtering criteria.
In the example below,I want to return both divs based on finding a regex match in a child element.
<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>
The basic code setup is as follows:
from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')
How do I add a regex test that evaluates the children of the target div? I.e, how would I add the regex call below to the 'find' or 'find_all' functions of beautiful-soup?
re.compile('regexmatch\d')

The approach I landed with was find_parent, which will return the parent element of the beautifulsoup results regardless of the method used to find the original result (regex or otherwise). For the example above:
childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()
...modified with the loop of your choice to cycle through all the members of childOfResult

Get the first div then run for loop on all div's
Example
from bs4 import BeautifulSoup
html = """<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>"""
page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')
for element in elements:
print(element.select("span:nth-child(1)")[0].text)
it prints out
regexmatch1
regexmatch2

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()

Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?

The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

BS4 Searching by Class_ Returning Empty

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>

try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete block in HTML based on text - python

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

beautifulsoup filtering descendents using regex

how to find second div from html in python beautifulsoup

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

BS4 Searching by Class_ Returning Empty

Categories

Resources