Missing parts in Beautiful Soup results - python

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?

I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

Related

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

how to get attribute data using python beautiful soup

Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650
Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650
Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650
I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Div Class Text not saving

I am trying to collect prices for films on Vudu. However, when I try to pull data from the relevant div container, it returns as empty.
from bs4 import BeautifulSoup
url = "https://www.vudu.com/content/movies/details/title/835625"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
price_container = html_soup.find_all('div', class_ = 'row nr-p-0 nr-mb-10')
Result:
In [43]: price_container
Out[43]: []
As you can see here, the price information is contained in a the div class I specified:
If you take a look at the page source, the <body> contains the following HTML:
<div id="loadingScreen">
<div class="loadingScreenViewport">
<div class="loadingScreenBody">
<div id="loadingIconClock">
<div class="loadingIconBox">
<div></div><div></div>
<div></div><div></div>
</div>
</div>
</div>
</div>
</div>
Everything else are the <script> tags (JavaScript). This website is heavily driven by JavaScript. That is, all the other contents are added dynamically.
As you can see, there is no div tag with class="row nr-p-0 nr-mb-10" in the page source (which is what requests.get(...) returns). This is why, price_container is an empty list.
You need to use other tools like Selenium to scrape this page.
Thanks for the tip to use Selenium. I was able to get the price information with the following code.
browser.get("https://www.vudu.com/content/movies/details/title/835625")
price_element = browser.find_elements_by_xpath("//div[#class='row nr-p-0 nr-mb-10']")
prices = [x.text for x in price_element]

BS4 Searching by Class_ Returning Empty

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>
try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

Categories

Resources