I'm looking to scrape the image src through this HTML code could anyone help me. I want to get the link for each image but it doesn't seem to work. At the moment it will display the image link. I tried adding the src=True that doesn't seem to fix it. It will print none. I've looked on this platform for any idea and I'm able to solve the problem maybe I'm doing something wrong. Any help would be appreciated.
The code
import requests, lxml.html
from bs4 import BeautifulSoup
url = requests.get("https://www.carsireland.ie/used-cars/bmw")
content = url.content
pri = lxml.html.fromstring(url.content)
soup = BeautifulSoup(content, 'lxml')
rows = soup.find_all("article", {"class": "listing"})
for row in rows:
img1 = row.find('div', {"class": "listing__images--main"}, 'img')
img2 = row.find('div', {"class": "listing__images--small"}, 'img')
img3 = row.find('div', {"class": "listing__images--small"}, 'img')
print(img2)
HTML CODE
<article about="/2739145" class="listing" role="article">
<div class="listing__images--main">
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe9f6541083f25076da0bf4218f7baa03f6.jpg" />
</div>
<div class="listing__images--small">
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe92d2f6814d6ed7098b46525ba484aca27.jpg" />
<img alt="BMW 316 2007" loading="lazy" src="https://c0.carsie.ie/d43864c90df075c94489ddbe4ca5ffe905915c7415aa1ce328840dfdbfd7d9cd.jpg" />
</div>
<div class="listing__details listing__details--desktop">
<div class="listing__details-location">
Meath
</div>
<div class="listing__details-vehicle">
<h2>BMW 316</h2>
<p>316I ES Z3SQ 4DR E90 SALOON N45 1.6</p>
</div>
<div class="listing__details-data">
<div class="listing__details-data-year">
<p>2007</p>
</div>
<div class="listing__details-data-mileage">
309 km
</div>
</div>
<div class="listing__details-pricing">
€900
<div class="listing__details-private-seller">Private</div>
</div>
<div class="listing__details-color">
<span class="" style="background-color: black;"></span>
<p>BLACK</p>
</div>
</div>
</article>
I would go from the parent article tag level then loop that and extract all img
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.carsireland.ie/used-cars/bmw')
soup = bs(r.content,'html.parser')
listings = soup.select('article')
for listing in listings:
print([i['src'] for i in listing.select('img')])
EDIT:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.carsireland.ie/used-cars/bmw')
soup = bs(r.content,'html.parser')
listings = soup.select('article')
for listing in listings:
large = listing.select_one('.listing__images--main img')['src']
small_top = listing.select_one('.listing__images--small img')['src']
small_btm = listing.select_one('.listing__images--small img + img')['src']
print(large, small_top, small_btm)
Related
I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.
.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on
I have tried to scrape a price from a certain website, a small sample of the HTML code is below:
</div>
</div>
<div class="right custom">
<div class="description custom">
<aside>
<h4>Availability:</h4>
<div>
<span class="label green">In Stock</span>
</div>
</aside>
<aside>
<h4>Price:</h4>
<div>
<span class="label">£65.40</span>
</div>
</aside>
<aside>
<h4>Ex Tax:</h4>
<div>
<span class="label">£54.50</span>
</div>
</aside>
<div class="price">
£65.40 </div>
<section class="custom-order">
<div class="options">
<div class="option" id="option-276">
<span class="required">*</span>
<label>Type & Extras:</label><br/>
<select name="option[276]">
<option value=""> --- Please Select --- </option>
<option value="146">Each </option>
</select>
</div>
</div>
<div class="quantity custom">
<label>Quantity:</label><br/>
<input name="quantity" size="2" type="text" value="1"/>
</div>
</section>
<!-- -->
<div class="cart">
<div>
I am trying to select the price of £54.50 (which is the price without UK tax).
The code I have used is below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
var1 = requests.get("https://www.website.co.uk",
headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
var2 = var1.content
soup=BeautifulSoup(var2, "html.parser")
span = soup.find("span", {"class":"label"})
price = span.text
price
Output: 'In Stock'
This 'In Stock' is located a few lines earlier in the HTML code.
<div>
<span class="label green">In Stock</span>
Can somebody please point me in the direction of picking up the correct span?
You selected span = soup.find("span", {"class":"label"}), the first span with class label and you got it. You get the expected value with span = soup.find_all("span", {"class":"label"}, limit=3)[2]
You can use a CSS Selector nth-child():
from bs4 import BeautifulSoup
txt = """THE ABOVE HTML"""
soup = BeautifulSoup(txt, "html.parser")
print(soup.select_one("aside:nth-child(3) > div > span").text)
Output:
£54.50
Another method.
from simplified_scrapy.spider import SimplifiedDoc
html = '''your html
'''
doc = SimplifiedDoc(html) # create doc
span = doc.getElement('span', start="Price:")
print (span.text)
Result:
£65.40
I am trying to get the links in image-file attribute (relative link as it is) in img tags under div with id previewImages (I don't want the src link).
Here is the sample HTML:
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
I tried the following but it only gives me the first link and not all:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
if images_box.find('img'):
imagesurl = images_box.find('img').get('image-file')
print imagesurl
How can I get all the links in image-file attritube for img tags in div with class previewImages?
Use .findAll
Ex:
from bs4 import BeautifulSoup
html = """<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
images_box = soup.find('div', attrs={'id': 'previewImages'})
for link in images_box.findAll("img"):
print link.get('image-file')
Output:
/image/15.jpg
/image/2.jpg
/image/0.jpg
/image/3.jpg
/image/4.jpg
I think it faster to use id with attribute selector passed to select
from bs4 import BeautifulSoup as bs
html = '''
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
'''
soup = bs(html, 'lxml')
links = [item['image-file'] for item in soup.select('#previewImages [image-file]')]
print(links)
BeautifulSoup has method .find_all() - check the docs. This is how you can use it in your code:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]
print links # in Python 3: print(links)
To Add up if in case we have do the same scenario with lxml,
import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/#image-file")
print(images)
Output
['/image/15.jpg', '/image/2.jpg', '/image/0.jpg', '/image/3.jpg', '/image/4.jpg']
I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
I found that I can extract all the information I want from this HTML. I need to extract title, href abd src from this.
HTML:
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3090" class="main">
<img src="/FileUploads/Post/3090.jpg?w=70&h=70&mode=crop" alt="apple" title="apple" />
</a>
</div>
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3091" class="main">
<img src="/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana" />
</a>
</div>
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
for b in a.select(".main"):
print ("http://www.cad.com"+b.get('href'))
print(b.get('title'))
I can successfully get href from this, but since title and src are in another line, I don't know how to extract them. After this, I want to save them in excel, so maybe I need to finish one first then do the second one.
Expected output:
/slim?p=3090
apple
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
/slim?p=3091
banana
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
My own solution:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
div = a.findAll('div', {"class": "home-hot-thumb"})
for div in div:
title=(div.img.get('title'))
print(title)
href=('http://www.cad.com/'+div.a.get('href'))
print(href)
src=('http://www.cad.com/'+div.img.get('src'))
print(src.replace('?w=70&h=70&mode=crop', ''))