BeautifulSoup.find_all for nested divs without class attribute - python

I am working with python2 and I wanted to get the content of a div in html page.
<div class="lts-txt2">
Some Content
</div>
If the div class is like above then I can get the content using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
But if the div is like,
<div class="lts-txt2">
<div align="justify">
Some Content
</div>
</div>
then using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
isn't return the content.
So I tried with
BeautifulSoup.find_all('div', attrs={"align": 'justify'})
But it also wasn't worked.
How can I solve the problem.

You can extract all text from the node including nested nodes with the Element.get_text() method:
[el.get_text() for el in soup.find_all('div', attrs={"class": 'lts-txt2'})]
This produces a list with the textual content of each such a div, wether or not there is a nested div inside.
You could also use the CSS selector Element.select() function to select the nested div:
soup.select('div.lts-txt2 > div')

Related

python iterate multiple tags using beautiful soup

i'm using python 3 and what i want to do is analyze an HTML page and extract some informations from specific tag.
This operation must be done multiple time. To get the HTML page i'm using beautifulsoup module and i can get correctly the html code by this way:
import urllib.request as req
import bs4
url = 'http://myurl.com'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')
soup = bs4.BeautifulSoup(reddit_data, 'lxml')
my html structure is the following:
<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
What i want to extract is the href value, and so http://www.myurl.com/
I tried using the find() function like this way and it works:
div = soup.find("div", {"class" : "first_div"})
But if i try to find directly the second div:
div = soup.find("div", {"class" : "second_div"})
it returns empty value
Thanks
EDIT:
the source html page is the following:
view-source:https://www.amazon.it/gp/goldbox/ref=gbps_ftr_s-5_2d1d_page_1?gb_f_deals1=dealTypes:LIGHTNING_DEAL%252CBEST_DEAL%252CDEAL_OF_THE_DAY,sortOrder:BY_SCORE&pf_rd_p=82dc915a-4dd2-4943-b59f-dbdbc6482d1d&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5Q5APCV900GSWS51A6QJ&ie=UTF8
What i have to extract is the href value from the a-row dealContainer dealTile div class
find Return only the first child of this Tag matching the given criteria.
But findAll Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
Here if you want to extract all href so you need to use for loop:
href = soup.findAll("div", {"class" : "first_div"})
for item in href:
print(img.get('href'))
Use Css selector which is much faster.
from bs4 import BeautifulSoup
reddit_data='''<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(reddit_data, 'lxml')
for item in soup.select(".first_div a[href]"):
print(item['href'])

How to extract anchor elements nested in multiple division elements

I am trying to extract anchor elements from my beautiful soup object with a common class attr each nested in multiple divisions. The divisions are repeated and separated with some scripts
I have tried to take advantage of the common class attrs in the anchor elements to extract them
The code I got:
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
#some scripts ....
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
What I tried:
import requests, bs4, webbrowser
webpage=requests.get('some url')
webpage.raise_for_status()
soup=bs4.BeautifulSoup(webpage.text)
links=soup.select('.link a')
for i in range(0,5):
webrowser.open('intial site url'+links[i].get('href'))
print(links)
No tabs were opened.Print links gave a blank list
Replace your line code:
links=soup.select('.link a')
To
links=soup.find_all('a',{'class':'link'})
print(links)
O/P:
[<a class="link" href="some url">
</a>, <a class="link" href="some url">
</a>]
To Get href form a tag:
for link in links:
href = link['href']
print(href)
.link a will do all child a tags with parents having class link. The space between is actually a css descendant combinator which means the lhs is parent and rhs is child. Remove the space to apply to same element. Notice that you need to extract the href attribute from the matched tags.
links = [item['href'] for item in soup.select('a.link')]
If you need to specify the parent div by class then it is
.nested a.link
or more simply
.nested .link

Div Class Text not saving

I am trying to collect prices for films on Vudu. However, when I try to pull data from the relevant div container, it returns as empty.
from bs4 import BeautifulSoup
url = "https://www.vudu.com/content/movies/details/title/835625"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
price_container = html_soup.find_all('div', class_ = 'row nr-p-0 nr-mb-10')
Result:
In [43]: price_container
Out[43]: []
As you can see here, the price information is contained in a the div class I specified:
If you take a look at the page source, the <body> contains the following HTML:
<div id="loadingScreen">
<div class="loadingScreenViewport">
<div class="loadingScreenBody">
<div id="loadingIconClock">
<div class="loadingIconBox">
<div></div><div></div>
<div></div><div></div>
</div>
</div>
</div>
</div>
</div>
Everything else are the <script> tags (JavaScript). This website is heavily driven by JavaScript. That is, all the other contents are added dynamically.
As you can see, there is no div tag with class="row nr-p-0 nr-mb-10" in the page source (which is what requests.get(...) returns). This is why, price_container is an empty list.
You need to use other tools like Selenium to scrape this page.
Thanks for the tip to use Selenium. I was able to get the price information with the following code.
browser.get("https://www.vudu.com/content/movies/details/title/835625")
price_element = browser.find_elements_by_xpath("//div[#class='row nr-p-0 nr-mb-10']")
prices = [x.text for x in price_element]

Get all links from DOM except from a certain div tag selenium python

How to get all links of the DOM except from a certain div tag??
This is the div I don't want links from:
<div id="yii-debug-toolbar">
<div class="yii-debug-toolbar_bar">
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
</div>
</div>
I get the links in my code lke this:
links = driver.find_elements_by_xpath("//a[#href]")
But I don't want to get the ones from that div, how can I do that?
I'm not sure if there is a simple way to do this with just seleniums xpath capabilities. However, a simple solution could be to parse the HTML with something like BeautifulSoup, get rid of all the <div id="yii-debug-toolbar">...</div> Elements, and then select the remaining links.
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(wd.page_source)
for div in soup.find_all("div", {'id':'yii-debug-toolbar'}):
div.decompose()
soup.find_all('a', href=True)

Scrapy scrape content having same class name

I am using scrapy to crawl and scrape data from a particular webiste. The crawle works fine, but i'm having issue when scraping content having from div having same class name. As for example:
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
I want to retrieve only this is the 1st div. The code i've used is:
desc = hxs.select('//div[#class = "same_name"]/text()').extract()
But it returns me all the contents. Any help would be really helpful !!
Ok , this one worked for me.
print desc[0]
It returned me this is the first div which is what i wanted.
You can use BeautifulSoup. Its a great html parser.
from BeautifulSoup import BeautifulSoup
html = """
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
"""
soup = BeautifulSoup(html)
print soup.text
That should do the work.
Using xpath you will get all the div with the same class, further, you can loop on them to get the result(for scrapy):
divs = response.xpath('//div[#class="full class name"]')
for div in divs:
if div.css("div.class"):

Categories

Resources