I want to extract the content of p tags from a webpage. The way it's structured is like this
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
I don't just want to use getText() because there's other content on the page I don't want. I've looked through documentation, but I'm still not sure how to to get the content from the p tags here
EDIT: I don't want to get all content from p tags, as there's other content in p tags on this page. I specifically only want to get the content that's in a div with the property 'pas:description'
You can use
soup.find('div', {'property': "pas:description"})
to find div with property and later you can search p inside this div
from bs4 import BeautifulSoup as BS
text = '''<p>without div 1</p>
<div property="pas:description">
<p>content 1</p>
<p>content 2</p>
</div>
<div>
<p>content in div without property </p>
</div>
<p>without div 2</p>'''
soup = BS(text, 'html.parser')
div = soup.find('div', {'property': "pas:description"})
for p in div.find_all('p'):
print(p.string)
Result
content 1
content 2
Below is code for extracting "content"
from bs4 import BeautifulSoup
test_html= '''
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
'''
soup4 = BeautifulSoup(test_html, 'html.parser')
print(soup4.find('div').p.text)
Related
I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})
there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)
I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml
i'm using python 3 and what i want to do is analyze an HTML page and extract some informations from specific tag.
This operation must be done multiple time. To get the HTML page i'm using beautifulsoup module and i can get correctly the html code by this way:
import urllib.request as req
import bs4
url = 'http://myurl.com'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')
soup = bs4.BeautifulSoup(reddit_data, 'lxml')
my html structure is the following:
<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
What i want to extract is the href value, and so http://www.myurl.com/
I tried using the find() function like this way and it works:
div = soup.find("div", {"class" : "first_div"})
But if i try to find directly the second div:
div = soup.find("div", {"class" : "second_div"})
it returns empty value
Thanks
EDIT:
the source html page is the following:
view-source:https://www.amazon.it/gp/goldbox/ref=gbps_ftr_s-5_2d1d_page_1?gb_f_deals1=dealTypes:LIGHTNING_DEAL%252CBEST_DEAL%252CDEAL_OF_THE_DAY,sortOrder:BY_SCORE&pf_rd_p=82dc915a-4dd2-4943-b59f-dbdbc6482d1d&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5Q5APCV900GSWS51A6QJ&ie=UTF8
What i have to extract is the href value from the a-row dealContainer dealTile div class
find Return only the first child of this Tag matching the given criteria.
But findAll Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
Here if you want to extract all href so you need to use for loop:
href = soup.findAll("div", {"class" : "first_div"})
for item in href:
print(img.get('href'))
Use Css selector which is much faster.
from bs4 import BeautifulSoup
reddit_data='''<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(reddit_data, 'lxml')
for item in soup.select(".first_div a[href]"):
print(item['href'])
the html is structured as so:
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
...
Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.
When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?
Here is my code for getting all divs with class = "my_class" and finding the important text:
my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
text_item = my_div.find('div') # to get to the div that contains the important text
print(text_item.getText())
Obviously, the output is:
important text
not important
important text
not important
...
When I want:
important text
important text
...
From the findall() documentation:
recursive is a boolean argument (defaulting to True) which tells Beautiful Soup
whether to go all the way down the parse tree, or whether to only look at the
immediate children of the Tag or the parser object.
So, given the first level of divs is for example under the tags <head> and <body>, you can set
soup.html.body.find_all('div', attrs={'class': 'my_class'},
recursive=False)
Output:
['important text', 'important text']
You can iterate over soup.contents:
from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']
Output:
['important text', 'important text']
with bs4 4.7.1 you can use :has and :first-child
from bs4 import BeautifulSoup as bs
html = '''<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>'''
soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])