How to find child of nested elements using Python BeautifulSoup - python

I want to get all the information within <span> tags within <p> tags within the <div> tag using Python and BeautifulSoup. I am looking for the information within the 'Data that I want to read' <span>.
<body>
<div id='output'>
<p style="overflow-wrap: break-word">CONNECTED</p>
<p style="overflow-wrap: break-word">SENT</p>
<p style="overflow-wrap: break-word">
<span style="color: blue">
Data that I want to read
<span/>
</p>
<div/>
<body/>
I have the following, which finds the text within the <div> tags and nothing else.
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
websiteData = soup.find_all("div")
for someData in websiteData:
childElement = someData.findChildren("p", recursive=True)
for child in childElement:
childElementofChildElement = child.findChildren("span", recursive=True)
for child in childElementofChildElement:
print(child)

You can use CSS selector for the task:
from bs4 import BeautifulSoup
html_doc = '''\
<body>
<div id='output'>
<p style="overflow-wrap: break-word">CONNECTED</p>
<p style="overflow-wrap: break-word">SENT</p>
<p style="overflow-wrap: break-word">
<span style="color: blue">
Data that I want to read
</span>
</p>
<div/>
<body/>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for t in soup.select('div#output p span'):
print(t.text.strip())
Prints:
Data that I want to read
CSS selector div#output p span means select all <span> tags that are under <p> tag and the <p> tag should be under <div> tag with id="output".

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

Find the parent tag of the most occurring tag - BeautifulSoup 4

While working on a scraper with BeautifulSoup, I ran into a problem where I needed to find the parent tag of the most occuring <p> tag on a page. For Example:
<div class="cls1">
<p>
<p>
<p>
</div>
<div class="cls2">
<p>
<P>
</div>
I need to get the the tag which has the most direct children that are <p> elements. In the above example, it would be <div class="cls1"> since there are 3 p tags as opposed to .cls2 which only contain 2.
Any suggestions on how I would approach this or if this is entirely possible?
You can use max() built-in function with custom key=:
data = '''<div class="cls1">
<p>
<p>
<p>
</div>
<div class="cls2">
<p>
<P>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html5lib')
print(max(soup.select('div:has(> p)'), key=lambda k: len(k.findChildren('p', recursive=False))))
Prints:
<div class="cls1">
<p>
</p><p>
</p><p>
</p></div>

How to get all hrefs(inside <a tag) and assign them to a variable??

I need all hrefs present in 'a' tag and assign it to a variable
I did this, but only got first link
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
userName = soup_level1.find(class_='_32mo')
link1 = (userName.get('href'))
And the output i get is
print(link1)
https://www.facebook.com/xxxxxx?ref=br_rs
But i need atleast top 3 or top 5 links
The structure of webpage is
`<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
I need those hrefs
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
for link in my_links:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
https://www.facebook.com/zzzzz?ref=br_rs
To get top n links you can use
max_num_of_links=2
for link in my_links[:max_num_of_links]:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
You can also save the top n links to a list
link_list=[]
max_num_of_links=2
for link in my_links[:max_num_of_links]:
link_list.append(link.get('href'))
print(link_list)
Output
['https://www.facebook.com/xxxxx?ref=br_rs', 'https://www.facebook.com/yyyyy?ref=br_rs']
EDIT:
If you need the driver to get the links one by one
max_num_of_links=3
for link in my_links[:max_num_of_links]:
driver.get(link.get('href'))
# rest of your code ...
For some reason if you want it in different variables like link1,link2 etc..
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
link1=my_links[0].get('href')
link2=my_links[1].get('href')
link3=my_links[2].get('href')
# and so on, but be careful here you don't want to try to access a link which is not there or you'll get index error

How to remove surplus tags from beautiful soup result

I want to get only the content in <p> tag and remove the surplus div tags.
My code is:
page = """
<p style="text-align: justify">content that I want
<div ><!-- /316485075/agk_116000_pos_3_sidebar_mobile -->
<div id="agk_116000_pos_3_sidebar_mobile">
<script>
script code
</script>
</div>
<div class="nopadding clearfix hidden-print">
<div align="center" class="col-md-12">
<!-- /316485075/agk_116000_pos_4_conteudo_desktop -->
<div id="agk_116000_pos_4_conteudo_desktop" style="height:90px; width:728px;">
<script>
script code
</script>
</div>
</div>
</div>
</div>
</p>
"""
soup = BeautifulSoup(page, 'html.parser')
p = soup.find_all('p', {'style' : 'text-align: justify'})
And I just want to get the string <p>content that I want</p> and remove all the divs
You can use the replace_with() function to remove the tags along with its contents.
soup = BeautifulSoup(html, 'html.parser') # html is HTML you've provided in question
soup.find('div').replace_with('')
print(soup)
Output:
<p style="text-align: justify">content that I want
</p>
Note: I'm using soup.find('div') here as all the unwanted tags are inside the first div tag. Hence, if you remove that tag, all others will get removed. But, if you want to remove all the tags other than the p tags in a HTML where the format is not like this, you'll have to use this:
for tag in soup.find_all():
if tag.name == 'p':
continue
tag.replace_with('')
Which is equivalent to:
[tag.replace_with('') for tag in soup.find_all(lambda t: t.name != 'p')]
If you simply want the content that I want text, you can use this:
print(soup.find('p').contents[0])
# content that I want
Capture group 2 contains your content <(.*?)(?:\s.+?>)(.*?)</\1[>]?
See https://regex101.com/r/m8DQic/1

Removing style from specific tags BeautifulSoup/Python

Let's say I have a soup and I'd like to remove all style tags for all the paragraphs. So I'd like to turn <p style='blah' id='bla' class=...> to <p id='bla' class=...> in the entire soup. But I don't want to touch, say, <img style='...'> tags. How would I do this?
The idea is to iterate over all p tags using find_all('p') and remove the style attribute:
from bs4 import BeautifulSoup
data = """
<body>
<p style='blah' id='bla1'>paragraph1</p>
<p style='blah' id='bla2'>paragraph2</p>
<p style='blah' id='bla3'>paragraph3</p>
<img style="awesome_image"/>
</body>"""
soup = BeautifulSoup(data, 'html.parser')
for p in soup.find_all('p'):
if 'style' in p.attrs:
del p.attrs['style']
print soup.prettify()
prints:
<body>
<p id="bla1">
paragraph1
</p>
<p id="bla2">
paragraph2
</p>
<p id="bla3">
paragraph3
</p>
<img style="awesome_image"/>
</body>

Categories

Resources