Beautiful Soup - Ignore child divs with same name as parent div - python

the html is structured as so:
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
...
Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.
When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?
Here is my code for getting all divs with class = "my_class" and finding the important text:
my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
text_item = my_div.find('div') # to get to the div that contains the important text
print(text_item.getText())
Obviously, the output is:
important text
not important
important text
not important
...
When I want:
important text
important text
...

From the findall() documentation:
recursive is a boolean argument (defaulting to True) which tells Beautiful Soup
whether to go all the way down the parse tree, or whether to only look at the
immediate children of the Tag or the parser object.
So, given the first level of divs is for example under the tags <head> and <body>, you can set
soup.html.body.find_all('div', attrs={'class': 'my_class'},
recursive=False)
Output:
['important text', 'important text']

You can iterate over soup.contents:
from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']
Output:
['important text', 'important text']

with bs4 4.7.1 you can use :has and :first-child
from bs4 import BeautifulSoup as bs
html = '''<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>'''
soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])

Related

beautifulsoup filtering descendents using regex

I'm trying to use beautiful-soup to return elements of the DOM that contain children that match filtering criteria.
In the example below,I want to return both divs based on finding a regex match in a child element.
<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>
The basic code setup is as follows:
from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')
How do I add a regex test that evaluates the children of the target div? I.e, how would I add the regex call below to the 'find' or 'find_all' functions of beautiful-soup?
re.compile('regexmatch\d')
The approach I landed with was find_parent, which will return the parent element of the beautifulsoup results regardless of the method used to find the original result (regex or otherwise). For the example above:
childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()
...modified with the loop of your choice to cycle through all the members of childOfResult
Get the first div then run for loop on all div's
Example
from bs4 import BeautifulSoup
html = """<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>"""
page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')
for element in elements:
print(element.select("span:nth-child(1)")[0].text)
it prints out
regexmatch1
regexmatch2

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

BeautifulSoup get p contents from a div

I want to extract the content of p tags from a webpage. The way it's structured is like this
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
I don't just want to use getText() because there's other content on the page I don't want. I've looked through documentation, but I'm still not sure how to to get the content from the p tags here
EDIT: I don't want to get all content from p tags, as there's other content in p tags on this page. I specifically only want to get the content that's in a div with the property 'pas:description'
You can use
soup.find('div', {'property': "pas:description"})
to find div with property and later you can search p inside this div
from bs4 import BeautifulSoup as BS
text = '''<p>without div 1</p>
<div property="pas:description">
<p>content 1</p>
<p>content 2</p>
</div>
<div>
<p>content in div without property </p>
</div>
<p>without div 2</p>'''
soup = BS(text, 'html.parser')
div = soup.find('div', {'property': "pas:description"})
for p in div.find_all('p'):
print(p.string)
Result
content 1
content 2
Below is code for extracting "content"
from bs4 import BeautifulSoup
test_html= '''
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
'''
soup4 = BeautifulSoup(test_html, 'html.parser')
print(soup4.find('div').p.text)

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Categories

Resources