I am trying to find the number of images (extensions .jpg, .png , jpeg) with the link through python. I can use any library such as beautifulsoup. But how do I do it.
I am using following code :
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('HTMLS%5C110k_Source.htm'), "html.parser")
img_links = len(soup.find_all('.jpg'))
print("Number of Images : ", img_links)
But all in vain.
You can try to use lxml.html as below:
from lxml import html
with open('HTMLS%5C110k_Source.htm', 'r') as f:
source = html.fromstring(f.read())
print(len(source.xpath('//img[contains(#src, ".jpg") or contains(#src, ".jpeg") or contains(#src, ".png")]')))
This is as easy as writing a loop if you read the docs
import bs4
import requests
url = 'somefoobar.net'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')
images = soup.findAll('img')
# loop through all img elements found and store the urls with matching extensions
urls = list(x for x in images if x['src'].split('.')[-1] in file_types)
print(urls)
print(len(urls))
Related
Is there a way to get all the flags from https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags using python code?
I tried with pd.read_html and did not succeed. I tried scraping but it got so messy and I couldn't do it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
# Scrap webpage
soup = BeautifulSoup(page.content, 'html.parser')
flags = soup.find_all('a', attrs={'class': "image"})
Would be nice if I can download them to a specific folder too!
Thanks in advance!
Just as alternative to yours and the well described approach of MattieTK you could also use css selectors to select your elements more specific:
soup.select('img[src*="/Flag_of"]')
Iterate the ResultSet, pick the src and use a function to download the images:
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
Example
import requests
from bs4 import BeautifulSoup
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
file_name = url.split('/')[-1]
with open(file_name,'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
else:
print('Image Couldn\'t be retrieved',url)
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content)
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
In your example flags is an array of anchor tags including the img tags.
What you want is a way to get each individual src attribute from the image tag.
You can achieve this by looping over the results of your soup.find_all like so. Each flag is separate, which allows you to get the contents of the flag (the image tag) and then the value of the src attribute.
for flag in soup.find_all('a', attrs={'class': "image"}):
src = flag.contents[0]['src'])
You can then work on downloading each of these to a file inside the loop.
I have a code works fine but I want to find URLs in a wide range. How can I do that?
import requests
import random
from bs4 import BeautifulSoup
img = []
word = 'dog'
url = 'https://www.google.com/search?q={0}&tbm=isch'.format(word)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
images = soup.findAll('img')
for image in images:
img.append(image.get('src'))
print(img[random.randint(1,21)])
Hi I've been trying all day to find a way to download some images from this
URL: https://omgcheckitout.com/these-trypophobia-photos-will
but when I run this code I always get only the URLs for the small images in the corner and not the ones found in the article.
(I've also tried other ways but I get always the same result)
'''
import requests, os
from bs4 import BeautifulSoup as bs
url = 'https://omgcheckitout.com/these-trypophobia-photos-will'
r = requests.get(url)
soup = bs(r.text, "html.parser")
images = soup.find_all('img')
for image in images:
print(images['src'])
'''
**Converting my comment to an answer
original comment:
"I believe what is happening here is that the page that you are seeing in the browser is being loaded dynamically with javascript. Try typing in '.html' to the page url and see what happens. The images in the redirect are what are being downloaded with your code. I recommend taking a look at this thread https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page"
Try to download them to your disk
import requests
from os.path import basename
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
lnk = link.get('src')
with open(basename(lnk), "wb") as f:
f.write(requests.get(lnk).content)
for image in images:
print(images['src'])
You can also use a select to filter your tags to only get the ones with http links:
for link in soup.select("img[src^=http]"):
lnk = link["src"]
with open(basename(lnk)," wb") as f:
f.write(requests.get(lnk).content)
I would like to know if it is possible to scrape images in websites with a code that can work for all the types of websites (I mean independently of the HTML format).
I have a list of websites ant I would need to get all the images related to each link.
For instance:
list_of links=['https://www.bbc.co.uk/programmes/articles/5nxMx7d1K8S6nhjkPBFhHSM/withering-wit-and-words-of-wisdom-oscar-wildes-best-quotes' , 'https://www.lastampa.it/torino/2020/03/31/news/coronavirus-il-lockdown-ha-gia-salvato-almeno-400-vite-umane-1.38659569' , and so on]
In general, I would use:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
link='...'
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images:
print(image['src']+'\n')
but I have doubt in terms of html (can it be used for each website?) and about the image format (.jpg; would it be the same for all the websites?).
Thank you for all your comments and suggestions.
Assuming all the images are inside src tag and those image elements aren't dynamically added (not virtual DOM), modifying your code a little bit would work:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
link= '...'
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {})
for image in images:
print(image['src']+'\n')
I'm trying to scrape the src of the img, but the code I found returns many img src, but not the one I want. I can't figure out what I am doing wrong. I am scraping TripAdvisor on "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
So this is the HTML snippet I'm trying to extract from:
<div class="restaurants-detail-overview-cards-LocationOverviewCard__cardColumn--2ALwF"><h6>Placering og kontaktoplysninger</h6><span><div><span data-test-target="staticMapSnapshot" class=""><img class="restaurants-detail-overview-cards-LocationOverviewCard__mapImage--22-Al" src="https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da¢er=55.687988,12.596316&markers=icon:http%3A%2F%2Fc1.tacdn.com%2F%2Fimg2%2Fmaps%2Ficons%2Fcomponent_map_pins_v1%2FR_Pin_Small.png|55.68799,12.596316"></span></div></span>
I want the code to return: (a sub-string from src)
55.68799,12.596316
I have tried:
import pandas as pd
pd.options.display.max_colwidth = 200
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import re
web_url = "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
url = urlopen(web_url)
url_html = url.read()
soup = bs(url_html, 'lxml')
soup.find_all('img')
for link in soup.find_all('img'):
print(link.get('src'))
the return is along the lines of this BUT NOT the src that I need :
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_primary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
You can do this with just requests and re. It is only the co-ordinates part of the src which are the location based variable.
import requests, re
p = re.compile(r'"coords":"(.*?)"')
r = requests.get('https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html')
coords = p.findall(r.text)[1]
src = f'https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da¢er={coords}&markers=icon:http://c1.tacdn.com//img2/maps/icons/component_map_pins_v1/R_Pin_Small.png|{coords}'
print(src)
print(coords)
Selenium is a workaround i tested it and works liek a charm. Here you are:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html")
links = driver.find_elements_by_xpath("//*[#src]")
urls = []
for link in links:
url = link.get_attribute('src')
if '|' in url:
urls.append(url.split('|')[1]) # saves in a list only the numbers you want i.e. 55.68799,12.596316
print(url)
print(urls)
Result of above
['55.68799,12.596316']
If you haven't used selenium before here you can find a webdriver https://chromedriver.storage.googleapis.com/index.html?path=2.46/
or here
https://sites.google.com/a/chromium.org/chromedriver/downloads