how to find second div from html in python beautifulsoup - python

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()

Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Related

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

BeautifulSoup get p contents from a div

I want to extract the content of p tags from a webpage. The way it's structured is like this
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
I don't just want to use getText() because there's other content on the page I don't want. I've looked through documentation, but I'm still not sure how to to get the content from the p tags here
EDIT: I don't want to get all content from p tags, as there's other content in p tags on this page. I specifically only want to get the content that's in a div with the property 'pas:description'
You can use
soup.find('div', {'property': "pas:description"})
to find div with property and later you can search p inside this div
from bs4 import BeautifulSoup as BS
text = '''<p>without div 1</p>
<div property="pas:description">
<p>content 1</p>
<p>content 2</p>
</div>
<div>
<p>content in div without property </p>
</div>
<p>without div 2</p>'''
soup = BS(text, 'html.parser')
div = soup.find('div', {'property': "pas:description"})
for p in div.find_all('p'):
print(p.string)
Result
content 1
content 2
Below is code for extracting "content"
from bs4 import BeautifulSoup
test_html= '''
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
'''
soup4 = BeautifulSoup(test_html, 'html.parser')
print(soup4.find('div').p.text)

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Beautiful Soup - Ignore child divs with same name as parent div

the html is structured as so:
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
...
Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.
When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?
Here is my code for getting all divs with class = "my_class" and finding the important text:
my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
text_item = my_div.find('div') # to get to the div that contains the important text
print(text_item.getText())
Obviously, the output is:
important text
not important
important text
not important
...
When I want:
important text
important text
...
From the findall() documentation:
recursive is a boolean argument (defaulting to True) which tells Beautiful Soup
whether to go all the way down the parse tree, or whether to only look at the
immediate children of the Tag or the parser object.
So, given the first level of divs is for example under the tags <head> and <body>, you can set
soup.html.body.find_all('div', attrs={'class': 'my_class'},
recursive=False)
Output:
['important text', 'important text']
You can iterate over soup.contents:
from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']
Output:
['important text', 'important text']
with bs4 4.7.1 you can use :has and :first-child
from bs4 import BeautifulSoup as bs
html = '''<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>'''
soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])

Categories

Resources