BS4 Searching by Class_ Returning Empty - python

I currently am successfully scraping the data I need by chaining bs4 .contents together following a find_all('div'), but that seems inherently fragile. I'd like to go directly to the tag I need by class, but my "class_=" search is returning None.
I ran the following code on the html below, which returns None:
soup = BeautifulSoup(text) # this works fine
tag = soup.find(class_ = "loan-section-content") # this returns None
Also tried soup.find('div', class_ = "loan-section-content") - also returns None.
My html is:
<div class="loan-section">
<div class="loan-section-title">
<span class="text-light"> Some Text </span>
</div>
<div class="loan-section-content">
<div class="row">
<div class="col-sm-6">
<strong>More text</strong>
<br/>
<strong>
Dakar, Senegal
</strong>

try this
soup.find(attrs={'class':'loan-section-content'})
or
soup.find('div','loan-section-content')
attrs will search on attributes
Demo:

Related

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Scraping from Same-named Tags with Python / BeautifulSoup

I'm getting my toes wet with BeautifulSoup and am hung up on scraping some particular info. The HTML looks like the following, for example:
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe</dd>
<dt>Institution:</dt>
<dd>American University</dd>
</dl>
</div>
...
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe, Jr.</dd>
<dt>Institution:</dt>
<dd>Another University</dd>
</dl>
</div>
...
Each page has about 10 <div class="row"> tags, each with the same <dt> and <dd> pattern (e.g., Language, Author, Institution).
I am trying to scrape the <dd>American University</dd> info, ultimately to create a loop so that I can get that info specific to each <div class="row"> tag.
I've managed the following so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.oeconsortium.org/courses/search/?search=statistics")
bsObj = BeautifulSoup(html.read(), "html.parser")
institutions = [x.text.strip() for x in bsObj.find_all('div', 'course-meta col-sm-12', 'dd')]
But that only gives me the following mess for each respective <div class="row"> :
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe\nInstitution:\nAmerican University\n
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe, Jr.\nInstitution:\nAnother University\n
...
(I know how to .strip(); that's not the issue.)
I can't figure out how to target that third <dd></dd> for each respective <div class="row">. I feel like it may be possible by also targeting the <dt>Institution:</dt> tag (which is "Institution" in every respective case), but I can't figure it out.
Any help is appreciated. I am trying to make a LOOP so that I can loop over, say, ten <div class="row"> instances and just pull out the info specific to the "Institution" <dd> tag.
Thank you!
I can't figure out how to target that third <dd></dd> for each respective <div class="row">
find_all will return a list of all occurrences, so you could just take the third element of the result. Although you may want to wrap the whole thing with try catch to prevent IndexError
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
I'd make a function out of this code to reuse it for different pages:
soup = BeautifulSoup(html.read(), "html.parser")
result = []
for meta in soup.find_all('div', {'class': 'course-meta'}):
try:
institution = meta.find_all('dd')[2].text.strip()
result.append(institution) # or whatever you like to store it
except IndexError:
pass

How to parse HTML to a string template in Python?

I want to parse HTML and turn them into string templates. In the example below, I seeked out elements marked with x-inner and they became template placeholders in the final string. Also x-attrsite also became a template placeholder (with a different command of course).
Input:
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
Desired output:
<div class="x,y,z" {attrsite}>{inner}<div>{inner}</div></div>
I know there is HTMLParser and BeautifulSoup, but I am at a loss on how to extract the strings before and after the x-* markers and to escape those strings for templating.
Existing curly braces are handled sanely, like this sample:
<div x-maybe-highlighted> The template string "there are {n} message{suffix}" can be used.</div>
BeautifulSoup can handle the case:
find all div elements with x-attrsite attribute, remove the attribute and add {attrsite} attribute with a value None (produces an attribute with no value)
find all div elements with x-inner attribute and use replace_with() to replace the element with a text {inner}
Implementation:
from bs4 import BeautifulSoup
data = """
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for div in soup.find_all('div', {'x-attrsite': True}):
del div['x-attrsite']
div['{attrsite}'] = None
for div in soup.find_all('div', {'x-inner': True}):
div.replace_with('{inner}')
print(soup.prettify())
Prints:
<div class="x,y,z" {attrsite}>
{inner}
<div>
{inner}
</div>
</div>

Categories

Resources