BeautifulSoup - how to call on a nested element

BeautifulSoup - how to call on a nested element - python

I just need a little help finding an element in my python script with Beautiful Soup.
Below is the html:
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
What I am trying to do is get the product price, and looking at the html above, it looks like it is found within this section from the html above (price is £3.49):
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
My issue is that even though I use Beautiful Soup to try and get the price like so:
pound = soup.find('span',attrs={'class':'pound'})
pence = soup.find('span',attrs={'class':'pence'})
prices.append(pound.text + pence.text)
I get this exception stating:
prices.append(pound.text + pence.text)
AttributeError: 'NoneType' object has no attribute 'text'
So to me it looks like it's returning a None or null. Does anybody have an idea on how I can get to the element?
EDIT
Looking at the answers below, I tried to replicate them but instead of using a static HTML, I call on the website url. What I noticed is that even though the code works for a static html, it doesn't work when I call on the url that contains the page that contains that html.
CODE:
from bs4 import BeautifulSoup
import pandas as pd
import requests
data = requests.get('https://www.screwfix.com/p/no-nonsense-sanitary-silicone-white-310ml/47187').text
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
Error:
currency_str = next(currency.strings).strip()
AttributeError: 'NoneType' object has no attribute 'strings'

I have taken your data as html so what approach you can follow get the text with in that div and use strip to remove unnecessary data now if you see main_div contain some letters so remove it by using re and you finally get your desired output
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,"html.parser")
main_div=soup.find("div",attrs={"class":"pr__price"}).get_text(strip=True)
lst=re.findall("\d+", main_div)
print(".".join(lst[:2]))
Output:
3.49

Here's a different approach.
from bs4 import BeautifulSoup
data = '''\
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49

Related

find() in beautifulsoup4 in python

I am trying to get the text ("INC000001") from the html code. I have tried many different ways and still unable.
here is what I tried.
soup.find("div", {"data-itrac-control-cd":"CS_ID"}).find("span").text()
I get this error
AttributeError: 'NoneType' object has no attribute 'find'
I have tried also .get_text() no luck
<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly">
<div>
<span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001
</span>
</div>
</div>

This is working
from bs4 import BeautifulSoup
html = '<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly"><div><span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001</span></div></div>'
soup = BeautifulSoup(html)
soup.div.div.span.text
Output:
'INC000001'

Extract html with no span class attribute and same div class attributes

I have found similar questions but none that directly address my issue. I have worked on this for about a week now with no luck.
I am trying to scrape data from this link: https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070
The issue is, the value I am looking for has no span-class attribute but when using the div class attributes, it shares the same name as other values on the page. I want my code to return $22,807 but anything I try either returns $25,195 or []. See the following HTML:
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2 text-muted" data-qa="vehicle-header-msrp"
data-test="vehicleHeaderMsrp">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="pricing-block-amount-strikethrough">$25,195</span>
</div>
</div>
</div>
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2" data-qa="vehicle-header-average-market-price"
data-test="vehicleHeaderAverageMarketPrice">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="">$22,807</span>
</div>
</div>
</div>
I can easily get the $25,195 returned with the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:70.0) Gecko/20190101 Firefox/70.0"
}
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
print(url)
page = requests.get(
url,
headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find('span', {'class': 'pricing-block-amount-strikethrough'})
print(test.get_text())
But no combination of calls that I try will return the $22,807 that I need.
What's interesting is that I can get the $25 value if I use
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2 text-muted'})
So I assumed that I could simply delete the "text-muted" part like:
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2'})
to get the $22 number but it just returns [ ].
Disclaimer: the dollar amount that I need changes frequently so if you help with this and end up getting a number slightly different than $22,807 it might still be correct. If you click on the link, the number I am looking for is the "Market Average" not the "MSRP."
Thank you!

If you browse the page it takes time for it to get the second value that you are looking for. In requests module it quickly gets the content doesn't wait for it to load completely. This is where you add selenium with bs4. To add the wait for the site to load then get the page content.
you can download the geckodriver from link
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
time.sleep(7)
soup = BeautifulSoup(driver.page_source, 'html')
div = soup.find_all('div', {'class': 'label-block-text'})
for x in div:
span = x.find('span')
print(span.get_text())

Beautifulsoup find_all() captures too much text

I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?

Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]

from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>

What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.

Print URL from two different BeautifulSoap outputs

I am scraping a few URLs in batch using BeautifulSoap.
Here is my script (only relevant stuff):
import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://example.com/foo/bar'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
url_box = soup.find('div', attrs={'class': 'player'})
print url_box
This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print).
Here's first kind of print:
<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>
And here's the other:
<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>
I want to extract the image URL which is poster in first and src in second.
Any ideas how I can do that so same script extracts that URL from either kind of print?
P.S The first print also has a mp4 link which I do not need.

You can use the get() method to get the value of attrs from the targeted tag.
You should be able to do something like this:
if url_box.find('video'):
url = url_box.find('video').get('poster')
mp4 = ulr_box.find('span').get('data-url')
if url_box.find('img'):
url = url_box.find('img').get('src')

Decide which version you are dealing with and split accordingly:
firstVersion = '''<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>'''
secondVersion = '''<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>'''
def extractImageUrl(htmlInput):
imageUrl = ""
if "poster" in htmlInput:
imageUrl = htmlInput.split('poster="')[1].split('"')[0]
elif "src" in htmlInput:
imageUrl = htmlInput.split('src="')[1].split('"')[0]
return imageUrl

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.

Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01

You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - how to call on a nested element - python

Related

find() in beautifulsoup4 in python

Extract html with no span class attribute and same div class attributes

Beautifulsoup find_all() captures too much text

Print URL from two different BeautifulSoap outputs

Isolate SRC attribute from soup return in python

Categories

Resources