Scraping images from websites in a list - python

I would like to know if it is possible to scrape images in websites with a code that can work for all the types of websites (I mean independently of the HTML format).
I have a list of websites ant I would need to get all the images related to each link.
For instance:
list_of links=['https://www.bbc.co.uk/programmes/articles/5nxMx7d1K8S6nhjkPBFhHSM/withering-wit-and-words-of-wisdom-oscar-wildes-best-quotes' , 'https://www.lastampa.it/torino/2020/03/31/news/coronavirus-il-lockdown-ha-gia-salvato-almeno-400-vite-umane-1.38659569' , and so on]
In general, I would use:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
link='...'
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images:
print(image['src']+'\n')
but I have doubt in terms of html (can it be used for each website?) and about the image format (.jpg; would it be the same for all the websites?).
Thank you for all your comments and suggestions.

Assuming all the images are inside src tag and those image elements aren't dynamically added (not virtual DOM), modifying your code a little bit would work:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
link= '...'
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {})
for image in images:
print(image['src']+'\n')

Related

How to scrape main headings of a website using python in colab?

Hi I am a beginner and would like to get the list of all datasets from the website 'https://www.kaggle.com/datasets' based on the filters 'csv' and 'only datasets with tasks'.
I applied the filters and inspected the element. My attempt returns an empty list. This is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/datasets?sort=usability&fileType=csv&tasks=true'
html = urlopen(url)
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('li')
print(titles)
Can anyone help?

Can't extract src attribute from "img" tag with BeautifulSoup

I'm working on a project and I'm trying to extract the pictures' URL from a website. I'm a noob at this so please bear with me. Based on the HTML code, the class of the pictures that I want is "fotorama__img". However, when I execute my code, it doesn't seem to work. Anyone knows why that's the case? Also, how come the src attribute doesn't contain the whole URL, just a part of it? Example: the link to the image is https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg but the src attribute of the img tag is "/files_SYS/images/System/sysThumb/SYS-120U-TNR_main.png".
Here is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR")
soup = BeautifulSoup(page.content,'lxml')
images = soup.find_all("img", {"class": "fotorama__img"})
for image in images:
print(image.get("src"))
And here is the picture of the HTML code for the page
Thank you for your help!
The class is added dynamically via JavaScript, so beautifulsoup doesn't see it. To extract the images from this site, you can do:
import requests
from bs4 import BeautifulSoup
page = requests.get(
"https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR"
)
soup = BeautifulSoup(page.content, "lxml")
images = [
"https://www.supermicro.com" + a["href"]
for a in soup.select(".fotorama > a")
]
print(*images, sep="\n")
Prints:
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_main.png
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_angle.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_top.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_rear.jpg

Having issues scraping the image from a website using bs4

Hey I can't seem to scrape the images from this website
https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok
I am using the following code
product.find('img', {'class': 'css-1fxh5tw product-card__hero-image'})['src']]
It returns this
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Your code was not wrong. I have extracted images
import requests
from bs4 import BeautifulSoup
url ="https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
images = soup.find_all('img', {'class':'css-1fxh5tw product-card__hero-image'},src=True)
for i in images:
if 'data:image' not in i['src']:
print(i['src'])

Exacting count of link Images

I am trying to find the number of images (extensions .jpg, .png , jpeg) with the link through python. I can use any library such as beautifulsoup. But how do I do it.
I am using following code :
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('HTMLS%5C110k_Source.htm'), "html.parser")
img_links = len(soup.find_all('.jpg'))
print("Number of Images : ", img_links)
But all in vain.
You can try to use lxml.html as below:
from lxml import html
with open('HTMLS%5C110k_Source.htm', 'r') as f:
source = html.fromstring(f.read())
print(len(source.xpath('//img[contains(#src, ".jpg") or contains(#src, ".jpeg") or contains(#src, ".png")]')))
This is as easy as writing a loop if you read the docs
import bs4
import requests
url = 'somefoobar.net'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')
images = soup.findAll('img')
# loop through all img elements found and store the urls with matching extensions
urls = list(x for x in images if x['src'].split('.')[-1] in file_types)
print(urls)
print(len(urls))

How to save graph / image from CGI website in python?

http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA
On the link above, I am trying to save the "Monthly Weather History Graph" in a python script. I have tried everything I can think of using BeautifulSoup and urrlib.
What I have been able to do is get to the point below, which I can extract, but I can not figure out how to save that graph as an image/HTML/PDF/anything. I am really not familiar with CGI, so any guidance here is much appreciated.
div id="history-graph-image"
img src="/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614**" alt="Monthly Weather History Graph" /
Get the page with requests, parse the HTML with BeautifulSoup, find the img tag inside div with id="history-graph-image" and get the src attribute value:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.wunderground.com'
url = 'http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA'
response = requests.get(url)
soup = BeautifulSoup(response.content)
image_relative_url = soup.find('div', id='history-graph-image').img.get('src')
image_url = urljoin(base_url, image_relative_url)
print image_url
Prints:
http://www.wunderground.com/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614
Then, download the file with urllib.urlretrieve():
import urllib
urllib.urlretrieve(image_url, "image.gif")

Categories

Resources