find() in beautifulsoup4 in python - python

I am trying to get the text ("INC000001") from the html code. I have tried many different ways and still unable.
here is what I tried.
soup.find("div", {"data-itrac-control-cd":"CS_ID"}).find("span").text()
I get this error
AttributeError: 'NoneType' object has no attribute 'find'
I have tried also .get_text() no luck
<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly">
<div>
<span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001
</span>
</div>
</div>

This is working
from bs4 import BeautifulSoup
html = '<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly"><div><span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001</span></div></div>'
soup = BeautifulSoup(html)
soup.div.div.span.text
Output:
'INC000001'

Related

BeautifulSoup - how to call on a nested element

I just need a little help finding an element in my python script with Beautiful Soup.
Below is the html:
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
What I am trying to do is get the product price, and looking at the html above, it looks like it is found within this section from the html above (price is £3.49):
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
My issue is that even though I use Beautiful Soup to try and get the price like so:
pound = soup.find('span',attrs={'class':'pound'})
pence = soup.find('span',attrs={'class':'pence'})
prices.append(pound.text + pence.text)
I get this exception stating:
prices.append(pound.text + pence.text)
AttributeError: 'NoneType' object has no attribute 'text'
So to me it looks like it's returning a None or null. Does anybody have an idea on how I can get to the element?
EDIT
Looking at the answers below, I tried to replicate them but instead of using a static HTML, I call on the website url. What I noticed is that even though the code works for a static html, it doesn't work when I call on the url that contains the page that contains that html.
CODE:
from bs4 import BeautifulSoup
import pandas as pd
import requests
data = requests.get('https://www.screwfix.com/p/no-nonsense-sanitary-silicone-white-310ml/47187').text
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
Error:
currency_str = next(currency.strings).strip()
AttributeError: 'NoneType' object has no attribute 'strings'
I have taken your data as html so what approach you can follow get the text with in that div and use strip to remove unnecessary data now if you see main_div contain some letters so remove it by using re and you finally get your desired output
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,"html.parser")
main_div=soup.find("div",attrs={"class":"pr__price"}).get_text(strip=True)
lst=re.findall("\d+", main_div)
print(".".join(lst[:2]))
Output:
3.49
Here's a different approach.
from bs4 import BeautifulSoup
data = '''\
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

BeautifulSoup locating iframe and its attribute

I have to get iframe src with beautiful soup
<div class="divclass">
<div id="simpleid">
<iframe width="300" height="300" src="http://google.com>
I could use selenium with code:
iframe1 = driver.find_element_by_class_name("divclass")
iframe = iframe1.find_element_by_tag_name("iframe").get_attribute("src")
but selenium is too slow for this task.
I've been looking for solution here on stackoverflow and tried several codes but always get error 403 while using urllib (changing browser agent is not working, still 403 error) or I get "None"
Use soup.find_all('tag you want to search')
>>> from bs4 import BeautifulSoup
>>> html = '''
... <div class="divclass">
... <div id="simpleid">
... <iframe width="300" height="300" src="http://google.com">
... '''
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all('iframe')
[<iframe height="300" src="http://google.com" width="300">
</iframe>]
>>> soup.find_all('iframe')[0]['src']
u'http://google.com'
>>>
Very good question.
Looking at the site you're trying to get that iframe from using that lib, you have to get the contents of tag in that div, and then base64 decode it and you should be done.
Seeing how you do things, don't stop! You're going to be a great programmer.

Error nonetype object has no attribute text while scraping via beautiful soup 4 python

I am trying to extract some info using beautiful soup in python using python webscraping. Here is the section.
<div class="result-value" data-reactid=".0.0.3.0.0.3.$0.1.1">
<span data-reactid=".0.0.3.0.0.3.$0.1.1.0">751</span>
<span class="result-value-unit" data-reactid=".0.0.3.0.0.3.$0.1.1.1">KB</span>
</div
Snap: https://www.dropbox.com/s/d349tb3f22o0wyf/4.png?dl=0
Code I am using is this
Sizeofweb=""
try:
Sizeofweb= soup.find('span', {'data-reactid': ".0.0.3.0.0.3.$0.1.1.0"}).text
print Sizeofweb
except StandardError as e:
converted_date="Error was {0}".format(e)
print converted_date
Error
nonetype object has no attribute text
I have tried this but didnt work. Where am I wrong?
This code works for me -
from bs4 import BeautifulSoup
html_str = """
<div class="result-value" data-reactid=".0.0.3.0.0.3.$0.1.1">
<span data-reactid=".0.0.3.0.0.3.$0.1.1.0">751</span>
<span class="result-value-unit" data-reactid=".0.0.3.0.0.3.$0.1.1.1">KB</span>
</div>
"""
soup = BeautifulSoup(html_str,"lxml")
Sizeofweb = soup.find('span', {'data-reactid': ".0.0.3.0.0.3.$0.1.1.0"}).text
print Sizeofweb
Output
751
One thing I noticed is the last div tag close is missing the close angle bracket - ">"
Dunno how you've done it but this works for me...

Targeting <a> with specific attribute using BeautifulSoup

I'm attempting to scrape a page that has a section like this:
<a name="id_631"></a>
<hr>
<div class="store-class">
<div>
<span><strong>Store City</strong</span>
</div>
<div class="store-class-content">
<p>Event listing</p>
<p>Event listing2</p>
<p>Event listing3</p>
</div>
<div>
Stuff about contact info
</div>
</div>
The page is a list of sections like that and the only way to differentiate them is by the name attribute in the <a> tag.
So I'm thinking I want to target that then go to the next_sibling to get the <hr> then again to the next sibling to get the <div class="store-class"> section. All I want is the info in that div tag.
I'm not sure how to target that <a> tag to move down two siblings though. When I try print(soup.find_all('a', {"name":"id_631"})) that just gives me what's in the tag, which is nothing.
Here's my script:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.tandyleather.com/en/leathercraft-classes")
soup = soup = BeautifulSoup(r.text, 'html.parser')
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
But I get the error:
Traceback (most recent call last):
File "tandy.py", line 8, in <module>
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
find_next_sibling() to the rescue:
soup.find("a", attrs={"name": "id_631"}).find_next_sibling("div", class_="store-class")
Also, html.parser has to replaced with either lxml or html5lib.
See also:
Differences between parsers

Categories

Resources