Missing parts in Beautiful Soup results

Missing parts in Beautiful Soup results - python

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?

I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

Related

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()

Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

how to get attribute data using python beautiful soup

Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650

Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650

Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650

I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>

To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Div Class Text not saving

I am trying to collect prices for films on Vudu. However, when I try to pull data from the relevant div container, it returns as empty.
from bs4 import BeautifulSoup
url = "https://www.vudu.com/content/movies/details/title/835625"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
price_container = html_soup.find_all('div', class_ = 'row nr-p-0 nr-mb-10')
Result:
In [43]: price_container
Out[43]: []
As you can see here, the price information is contained in a the div class I specified:

If you take a look at the page source, the <body> contains the following HTML:
<div id="loadingScreen">
<div class="loadingScreenViewport">
<div class="loadingScreenBody">
<div id="loadingIconClock">
<div class="loadingIconBox">
<div></div><div></div>
<div></div><div></div>
</div>
</div>
</div>
</div>
</div>
Everything else are the <script> tags (JavaScript). This website is heavily driven by JavaScript. That is, all the other contents are added dynamically.
As you can see, there is no div tag with class="row nr-p-0 nr-mb-10" in the page source (which is what requests.get(...) returns). This is why, price_container is an empty list.
You need to use other tools like Selenium to scrape this page.

Thanks for the tip to use Selenium. I was able to get the price information with the following code.
browser.get("https://www.vudu.com/content/movies/details/title/835625")
price_element = browser.find_elements_by_xpath("//div[#class='row nr-p-0 nr-mb-10']")
prices = [x.text for x in price_element]

BS4 Searching by Class_ Returning Empty

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>

try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing parts in Beautiful Soup results - python

I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this: soup.find('div',attrs={'class':'winrate'})

Related

how to find second div from html in python beautifulsoup

how to get attribute data using python beautiful soup

Delete block in HTML based on text

Div Class Text not saving

BS4 Searching by Class_ Returning Empty

Categories

Resources