BS4 cannot select correct 'span' - python

I have tried to scrape a price from a certain website, a small sample of the HTML code is below:
</div>
</div>
<div class="right custom">
<div class="description custom">
<aside>
<h4>Availability:</h4>
<div>
<span class="label green">In Stock</span>
</div>
</aside>
<aside>
<h4>Price:</h4>
<div>
<span class="label">£65.40</span>
</div>
</aside>
<aside>
<h4>Ex Tax:</h4>
<div>
<span class="label">£54.50</span>
</div>
</aside>
<div class="price">
£65.40 </div>
<section class="custom-order">
<div class="options">
<div class="option" id="option-276">
<span class="required">*</span>
<label>Type & Extras:</label><br/>
<select name="option[276]">
<option value=""> --- Please Select --- </option>
<option value="146">Each </option>
</select>
</div>
</div>
<div class="quantity custom">
<label>Quantity:</label><br/>
<input name="quantity" size="2" type="text" value="1"/>
</div>
</section>
<!-- -->
<div class="cart">
<div>
I am trying to select the price of £54.50 (which is the price without UK tax).
The code I have used is below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
var1 = requests.get("https://www.website.co.uk",
headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
var2 = var1.content
soup=BeautifulSoup(var2, "html.parser")
span = soup.find("span", {"class":"label"})
price = span.text
price
Output: 'In Stock'
This 'In Stock' is located a few lines earlier in the HTML code.
<div>
<span class="label green">In Stock</span>
Can somebody please point me in the direction of picking up the correct span?

You selected span = soup.find("span", {"class":"label"}), the first span with class label and you got it. You get the expected value with span = soup.find_all("span", {"class":"label"}, limit=3)[2]

You can use a CSS Selector nth-child():
from bs4 import BeautifulSoup
txt = """THE ABOVE HTML"""
soup = BeautifulSoup(txt, "html.parser")
print(soup.select_one("aside:nth-child(3) > div > span").text)
Output:
£54.50

Another method.
from simplified_scrapy.spider import SimplifiedDoc
html = '''your html
'''
doc = SimplifiedDoc(html) # create doc
span = doc.getElement('span', start="Price:")
print (span.text)
Result:
£65.40

Related

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.
You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

how to get text of the latest post with BeautifulSoup, select()

I'd like to get the latest posts text using BeautifulSoup and select() method.
import requests
from bs4 import BeautifulSoup
headers = 'User-Agent':'Mozilla/5.0'
url = "https:// "
req = requests.get(url, headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
link = soup.select('#flagList > div.clear.ab-webzine > div > a')
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')
latest_link = link[0] # link of latest post
latest_title = title[0].text # title of latest post
# to get the text of latest post
t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')
maintext = t_soup.select ('#flagArticle > div.document_1234567_0.rhymix_content.xe_content')
print(maintext)
It returns [].
I copied #flagArticle > div.document_1234567_0.rhymix_content.xe_content from chrome developer tools on the posts. so it has specific post number "1234567"
But I want the text of "latest post" not certain post.
So I changed it to just #flagArticle
And it returns as below.
[<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
-- color class --
vb-white
vb-green
vb-blue
vb-skyblue
vb-orange
vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
recommended </span>
<span class="num" id="vm_v_count">
4 </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
report </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>]
But I want to get
TEXTTEXTTEXT 1
TEXTTEXTTEXT 2
TEXTTEXTTEXT 3
What should I change?
(I can't share the URL because it's private site)
Just get the first div.
from bs4 import BeautifulSoup
data = '''\
<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
-- color class --
vb-white
vb-green
vb-blue
vb-skyblue
vb-orange
vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
recommended </span>
<span class="num" id="vm_v_count">
4 </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
report </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.select_one('#flagArticle div.xe_content.rhymix_content')
for p in div.select('p'):
print(p.text)

Finding the Img src beautifulsoup

I'm looking to scrape the image src through this HTML code could anyone help me. I want to get the link for each image but it doesn't seem to work. At the moment it will display the image link. I tried adding the src=True that doesn't seem to fix it. It will print none. I've looked on this platform for any idea and I'm able to solve the problem maybe I'm doing something wrong. Any help would be appreciated.
The code
import requests, lxml.html
from bs4 import BeautifulSoup
url = requests.get("https://www.carsireland.ie/used-cars/bmw")
content = url.content
pri = lxml.html.fromstring(url.content)
soup = BeautifulSoup(content, 'lxml')
rows = soup.find_all("article", {"class": "listing"})
for row in rows:
img1 = row.find('div', {"class": "listing__images--main"}, 'img')
img2 = row.find('div', {"class": "listing__images--small"}, 'img')
img3 = row.find('div', {"class": "listing__images--small"}, 'img')
print(img2)
HTML CODE
<article about="/2739145" class="listing" role="article">
<div class="listing__images--main">
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe9f6541083f25076da0bf4218f7baa03f6.jpg" />
</div>
<div class="listing__images--small">
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe92d2f6814d6ed7098b46525ba484aca27.jpg" />
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe905915c7415aa1ce328840dfdbfd7d9cd.jpg" />
</div>
<div class="listing__details listing__details--desktop">
<div class="listing__details-location">
Meath
</div>
<div class="listing__details-vehicle">
<h2>BMW 316</h2>
<p>316I ES Z3SQ 4DR E90 SALOON N45 1.6</p>
</div>
<div class="listing__details-data">
<div class="listing__details-data-year">
<p>2007</p>
</div>
<div class="listing__details-data-mileage">
309 km
</div>
</div>
<div class="listing__details-pricing">
€900
<div class="listing__details-private-seller">Private</div>
</div>
<div class="listing__details-color">
<span class="" style="background-color: black;"></span>
<p>BLACK</p>
</div>
</div>
</article>
I would go from the parent article tag level then loop that and extract all img
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.carsireland.ie/used-cars/bmw')
soup = bs(r.content,'html.parser')
listings = soup.select('article')
for listing in listings:
print([i['src'] for i in listing.select('img')])
EDIT:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.carsireland.ie/used-cars/bmw')
soup = bs(r.content,'html.parser')
listings = soup.select('article')
for listing in listings:
large = listing.select_one('.listing__images--main img')['src']
small_top = listing.select_one('.listing__images--small img')['src']
small_btm = listing.select_one('.listing__images--small img + img')['src']
print(large, small_top, small_btm)

BeautifulSoap get multiple element for all img in a div with specific class

I am trying to get the links in image-file attribute (relative link as it is) in img tags under div with id previewImages (I don't want the src link).
Here is the sample HTML:
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
I tried the following but it only gives me the first link and not all:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
if images_box.find('img'):
imagesurl = images_box.find('img').get('image-file')
print imagesurl
How can I get all the links in image-file attritube for img tags in div with class previewImages?
Use .findAll
Ex:
from bs4 import BeautifulSoup
html = """<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
images_box = soup.find('div', attrs={'id': 'previewImages'})
for link in images_box.findAll("img"):
print link.get('image-file')
Output:
/image/15.jpg
/image/2.jpg
/image/0.jpg
/image/3.jpg
/image/4.jpg
I think it faster to use id with attribute selector passed to select
from bs4 import BeautifulSoup as bs
html = '''
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
'''
soup = bs(html, 'lxml')
links = [item['image-file'] for item in soup.select('#previewImages [image-file]')]
print(links)
BeautifulSoup has method .find_all() - check the docs. This is how you can use it in your code:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]
print links # in Python 3: print(links)
To Add up if in case we have do the same scenario with lxml,
import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/#image-file")
print(images)
Output
['/image/15.jpg', '/image/2.jpg', '/image/0.jpg', '/image/3.jpg', '/image/4.jpg']

Python v3 , Beautifoulsoup - multiple div tags with same name

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup(markup, "lxml")
items = soup.find_all("div","_3u1 _gli _uvb", recursive=True)
for item in items:
abouts = item.find_all("div", {"class":"_glo"}, recursive = True)[0].text
print (abouts)
HTML page:
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
Afternoon , i am trying to scrape a webpage using beautifullsoup, python. I need al the "text" strings in a separate variable. When i print abouts i get :"text text text" I want it to be seperated.
Kind regards
Try this:
items = soup.find_all('div', attrs={'class':'_ajw'})
dict = {}
for i in range(len(items)):
dict['text'+str(i+1)] = item[i].find('div', attrs={'class':'_52eh'}).text
print(dict)
This will give you something like this:
{'text1': text, 'text2': text, 'text3': text}
I'd use soup.select to apply a class selector to the html. It is a fast method to get a list of the appropriate elements by class
from bs4 import BeautifulSoup as bs
html = '''
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('._52eh')]
print(items)

Categories

Resources