I'm attempting to scrape a page that has a section like this:
<a name="id_631"></a>
<hr>
<div class="store-class">
<div>
<span><strong>Store City</strong</span>
</div>
<div class="store-class-content">
<p>Event listing</p>
<p>Event listing2</p>
<p>Event listing3</p>
</div>
<div>
Stuff about contact info
</div>
</div>
The page is a list of sections like that and the only way to differentiate them is by the name attribute in the <a> tag.
So I'm thinking I want to target that then go to the next_sibling to get the <hr> then again to the next sibling to get the <div class="store-class"> section. All I want is the info in that div tag.
I'm not sure how to target that <a> tag to move down two siblings though. When I try print(soup.find_all('a', {"name":"id_631"})) that just gives me what's in the tag, which is nothing.
Here's my script:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.tandyleather.com/en/leathercraft-classes")
soup = soup = BeautifulSoup(r.text, 'html.parser')
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
But I get the error:
Traceback (most recent call last):
File "tandy.py", line 8, in <module>
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
find_next_sibling() to the rescue:
soup.find("a", attrs={"name": "id_631"}).find_next_sibling("div", class_="store-class")
Also, html.parser has to replaced with either lxml or html5lib.
See also:
Differences between parsers
Related
I am trying to get the text ("INC000001") from the html code. I have tried many different ways and still unable.
here is what I tried.
soup.find("div", {"data-itrac-control-cd":"CS_ID"}).find("span").text()
I get this error
AttributeError: 'NoneType' object has no attribute 'find'
I have tried also .get_text() no luck
<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly">
<div>
<span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001
</span>
</div>
</div>
This is working
from bs4 import BeautifulSoup
html = '<div id="186164592" data-itrac-item-id="186164592" data-itrac-control-cd="CS_ID" class="ui-controlgroup-controls itrac-displayonly"><div><span class="display_only ui-body-j itrac-label-nobodybg" id="186164592">INC000001</span></div></div>'
soup = BeautifulSoup(html)
soup.div.div.span.text
Output:
'INC000001'
I just need a little help finding an element in my python script with Beautiful Soup.
Below is the html:
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
What I am trying to do is get the product price, and looking at the html above, it looks like it is found within this section from the html above (price is £3.49):
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
My issue is that even though I use Beautiful Soup to try and get the price like so:
pound = soup.find('span',attrs={'class':'pound'})
pence = soup.find('span',attrs={'class':'pence'})
prices.append(pound.text + pence.text)
I get this exception stating:
prices.append(pound.text + pence.text)
AttributeError: 'NoneType' object has no attribute 'text'
So to me it looks like it's returning a None or null. Does anybody have an idea on how I can get to the element?
EDIT
Looking at the answers below, I tried to replicate them but instead of using a static HTML, I call on the website url. What I noticed is that even though the code works for a static html, it doesn't work when I call on the url that contains the page that contains that html.
CODE:
from bs4 import BeautifulSoup
import pandas as pd
import requests
data = requests.get('https://www.screwfix.com/p/no-nonsense-sanitary-silicone-white-310ml/47187').text
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
Error:
currency_str = next(currency.strings).strip()
AttributeError: 'NoneType' object has no attribute 'strings'
I have taken your data as html so what approach you can follow get the text with in that div and use strip to remove unnecessary data now if you see main_div contain some letters so remove it by using re and you finally get your desired output
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,"html.parser")
main_div=soup.find("div",attrs={"class":"pr__price"}).get_text(strip=True)
lst=re.findall("\d+", main_div)
print(".".join(lst[:2]))
Output:
3.49
Here's a different approach.
from bs4 import BeautifulSoup
data = '''\
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
I am trying to pull out some content inside of a div tag declaration:
<div class="search-listing font-size-10 my-3 my-md-0 py-0 py-md-4" listing_id="5327969" latitude="28.92327" longitude="-27.0365">
.
.
.
</div>
What I want is the latitude & longitude. But I can't seem to be able to access the content inside of the div declaration itself. I can only get into the children items. I'm using html.parser
if I try to do:
line.select('div[class*py-md-4"]')[0])
I get an index error.
This was never going to work:
coords = soup.find_all("longitude")
I've tried:
divisions = soup.select('div[class*=search-listing]')
for line in divisions:
print(line.select('div[class*=py-md-4]')[0])
but each time I try to extract items from line - it gives me the children of the div.
I am expecting to be able to pull out both the longitude & latitude from the Div - but to no avail. Surely it must be possible?
You can use CSS selector [latitude][longitude]. This will select every tag that has defined attributes latitude= and longitude=:
data = '''<div class="search-listing font-size-10 my-3 my-md-0 py-0 py-md-4" listing_id="5327969" latitude="28.92327" longitude="-27.0365">
<p>Some text</p>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for tag in soup.select('[latitude][longitude]'):
print('lat={} lon={}'.format(tag['latitude'], tag['longitude']))
Prints:
lat=28.92327 lon=-27.0365
Further reading:
CSS Selectors Reference
I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.
The following code works:
# -*- coding: utf-8 -*-
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
news_uri = 'http://www3.nhk.or.jp/news/easy/k10011356641000/k10011356641000.html'
r = requests.get(news_uri)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'html.parser')
body = soup.find('div', attrs={'id':'newsarticle'})
#body.div.unwrap()
for match in body.findAll('span'):
match.unwrap()
for match in body.findAll('a'):
match.unwrap()
print(str(body))
However, if you uncomment body.div.unwrap() it results in the following error:
Traceback (most recent call last):
File "test_div.py", line 13, in <module>
body.div.unwrap()
AttributeError: 'NoneType' object has no attribute 'unwrap'
I have done a test using the plain text output from:
body = soup.find('div', attrs={'id':'newsarticle'})
This then works as expected and removes the outer div. Any suggestions?
As body = soup.find('div', attrs={'id':'newsarticle'}), the body variable contains the following HTML:
<div id="newsarticle">
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<p></p>
<p></p>
</div>
That means, the direct descendants of the <div> tag are only the <p> tags. Using body.div means the code tries to find a div tag which is a direct descendant of the current div tag. Since there is no such tag present, body.div evaluates to None.
Because of this, body.div.unwrap() evaluates to None.unwrap() which as you can see will throw the error AttributeError: 'NoneType' object has no attribute 'unwrap'.
If you want to remove the div tag, simply use this:
body = soup.find('div', attrs={'id':'newsarticle'})
body.unwrap()
or
soup.find('div', attrs={'id':'newsarticle'}).unwrap()