I am trying to collect prices for films on Vudu. However, when I try to pull data from the relevant div container, it returns as empty.
from bs4 import BeautifulSoup
url = "https://www.vudu.com/content/movies/details/title/835625"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
price_container = html_soup.find_all('div', class_ = 'row nr-p-0 nr-mb-10')
Result:
In [43]: price_container
Out[43]: []
As you can see here, the price information is contained in a the div class I specified:
If you take a look at the page source, the <body> contains the following HTML:
<div id="loadingScreen">
<div class="loadingScreenViewport">
<div class="loadingScreenBody">
<div id="loadingIconClock">
<div class="loadingIconBox">
<div></div><div></div>
<div></div><div></div>
</div>
</div>
</div>
</div>
</div>
Everything else are the <script> tags (JavaScript). This website is heavily driven by JavaScript. That is, all the other contents are added dynamically.
As you can see, there is no div tag with class="row nr-p-0 nr-mb-10" in the page source (which is what requests.get(...) returns). This is why, price_container is an empty list.
You need to use other tools like Selenium to scrape this page.
Thanks for the tip to use Selenium. I was able to get the price information with the following code.
browser.get("https://www.vudu.com/content/movies/details/title/835625")
price_element = browser.find_elements_by_xpath("//div[#class='row nr-p-0 nr-mb-10']")
prices = [x.text for x in price_element]
Related
I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})
there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)
I want to extract the content of p tags from a webpage. The way it's structured is like this
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
I don't just want to use getText() because there's other content on the page I don't want. I've looked through documentation, but I'm still not sure how to to get the content from the p tags here
EDIT: I don't want to get all content from p tags, as there's other content in p tags on this page. I specifically only want to get the content that's in a div with the property 'pas:description'
You can use
soup.find('div', {'property': "pas:description"})
to find div with property and later you can search p inside this div
from bs4 import BeautifulSoup as BS
text = '''<p>without div 1</p>
<div property="pas:description">
<p>content 1</p>
<p>content 2</p>
</div>
<div>
<p>content in div without property </p>
</div>
<p>without div 2</p>'''
soup = BS(text, 'html.parser')
div = soup.find('div', {'property': "pas:description"})
for p in div.find_all('p'):
print(p.string)
Result
content 1
content 2
Below is code for extracting "content"
from bs4 import BeautifulSoup
test_html= '''
<div property="pas:description">
<p>content</p>
<p>content</p>
</div>
'''
soup4 = BeautifulSoup(test_html, 'html.parser')
print(soup4.find('div').p.text)
i'm using python 3 and what i want to do is analyze an HTML page and extract some informations from specific tag.
This operation must be done multiple time. To get the HTML page i'm using beautifulsoup module and i can get correctly the html code by this way:
import urllib.request as req
import bs4
url = 'http://myurl.com'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')
soup = bs4.BeautifulSoup(reddit_data, 'lxml')
my html structure is the following:
<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
What i want to extract is the href value, and so http://www.myurl.com/
I tried using the find() function like this way and it works:
div = soup.find("div", {"class" : "first_div"})
But if i try to find directly the second div:
div = soup.find("div", {"class" : "second_div"})
it returns empty value
Thanks
EDIT:
the source html page is the following:
view-source:https://www.amazon.it/gp/goldbox/ref=gbps_ftr_s-5_2d1d_page_1?gb_f_deals1=dealTypes:LIGHTNING_DEAL%252CBEST_DEAL%252CDEAL_OF_THE_DAY,sortOrder:BY_SCORE&pf_rd_p=82dc915a-4dd2-4943-b59f-dbdbc6482d1d&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5Q5APCV900GSWS51A6QJ&ie=UTF8
What i have to extract is the href value from the a-row dealContainer dealTile div class
find Return only the first child of this Tag matching the given criteria.
But findAll Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
Here if you want to extract all href so you need to use for loop:
href = soup.findAll("div", {"class" : "first_div"})
for item in href:
print(img.get('href'))
Use Css selector which is much faster.
from bs4 import BeautifulSoup
reddit_data='''<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(reddit_data, 'lxml')
for item in soup.select(".first_div a[href]"):
print(item['href'])
How to get all links of the DOM except from a certain div tag??
This is the div I don't want links from:
<div id="yii-debug-toolbar">
<div class="yii-debug-toolbar_bar">
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
</div>
</div>
I get the links in my code lke this:
links = driver.find_elements_by_xpath("//a[#href]")
But I don't want to get the ones from that div, how can I do that?
I'm not sure if there is a simple way to do this with just seleniums xpath capabilities. However, a simple solution could be to parse the HTML with something like BeautifulSoup, get rid of all the <div id="yii-debug-toolbar">...</div> Elements, and then select the remaining links.
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(wd.page_source)
for div in soup.find_all("div", {'id':'yii-debug-toolbar'}):
div.decompose()
soup.find_all('a', href=True)