Web crawler doesn't print every image source - python

I am trying to make a web crawler that will give me all the links to images in the given URL, but many of the images that I found, while looking in the page source and searching in the page source with CTRL+F, were not printed in the output.
my code is:
import requests
from bs4 import BeautifulSoup
import urllib
import os
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
i = 0
while i < 1:
source_code = requests.get(website_url) # The source code will have the page source (<html>.......</html>
plain_text = source_code.text # Gets only the text from the source code
soup = BeautifulSoup(plain_text, "html5lib")
for link in soup.findAll('img'): # A loop which looking for all the images in the website
src = link.get('src') # I want to get the image URL and its located under 'src' in HTML
if 'http://' not in src and 'https://' not in src:
if src[0] != '/':
src = '/' + src
src = website_url + src
print src
i += 1
How should I make my code print every image that is in an <img> in the HTML page source?
For example: the website has this HTML code:
<img src="http://shippuden.co.il/wp-content/uploads/newkadosh21.jpg" *something* >
But the script didn't print its src.
The script is printing the src in <img .... src="...">
How should I improve my code to find all the images?

Taking a look on the main page of the domain you posted on the example, I see that the image you refer is not on src, but on data-lazy-src attribute.
So you should parse both attributes like:
src = link.get('src')
lazy_load_src = link.get('data-lazy-src')
Actually when running the example software you showed, the img src for the image newkadosh21 is printed, but it is an base64 like:
src=""

Related

Python Web scraping Hidden jpg images that I can't figure out how to download from this internet site

Hi I've been trying all day to find a way to download some images from this
URL: https://omgcheckitout.com/these-trypophobia-photos-will
but when I run this code I always get only the URLs for the small images in the corner and not the ones found in the article.
(I've also tried other ways but I get always the same result)
'''
import requests, os
from bs4 import BeautifulSoup as bs
url = 'https://omgcheckitout.com/these-trypophobia-photos-will'
r = requests.get(url)
soup = bs(r.text, "html.parser")
images = soup.find_all('img')
for image in images:
print(images['src'])
'''
**Converting my comment to an answer
original comment:
"I believe what is happening here is that the page that you are seeing in the browser is being loaded dynamically with javascript. Try typing in '.html' to the page url and see what happens. The images in the redirect are what are being downloaded with your code. I recommend taking a look at this thread https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page"
Try to download them to your disk
import requests
from os.path import basename
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
lnk = link.get('src')
with open(basename(lnk), "wb") as f:
f.write(requests.get(lnk).content)
for image in images:
print(images['src'])
You can also use a select to filter your tags to only get the ones with http links:
for link in soup.select("img[src^=http]"):
lnk = link["src"]
with open(basename(lnk)," wb") as f:
f.write(requests.get(lnk).content)

how to read all images alt in html page using selenium?

I am trying to test my website using selenium. And I want to check all images have filled alt attribute or not. So how to check this.
<img src="/media/images/biren.png" alt="N. Biren" class ="img-fluid" >
I'm not well versed with selenium, but this can be done easily using requests with bs4 (simple web-scraping).
Please find an example code below:
import requests, bs4
url = 'HTML URL HERE!'
# get the url html txt
page = requests.get(url).text
# parse the html txt
soup = bs4.BeautifulSoup(page, 'html.parser')
# get all img tags
for image in soup.find_all('img'):
try:
# print alt text if exists
print(image['alt'])
except:
# print the complete img tag if not
print(image)

How to get full image url with python

I have some problem with parsing links for images (images of house) on this site (https://kvartiry-bolgarii.ru/trekhkomnatnaya-kvartira-v-blagoustroennom-i-spokoynom-kurortnom-poselke-o26252)
How can i get full link?
How can I get data from src (in all images) and combine it into full link with site domain?
Im try it but cant get full link because dont kbow how to take link in src
import requests
from bs4 import BeautifulSoup
rs = requests.get('https://kvartiry-bolgarii.ru/neveroyatnaya-kvartira-s-vidom-na-more-tip-pentkhaus-o26253')
root = BeautifulSoup(rs.content, 'html.parser')
urls = root.select('#slider > li > img[src]')
print(urls)
# [<img alt="" src="/photos/5e2c79b4-7da2-478e-a783-ad8f010d0b15.jpg"/>, , <img alt="" src="/photos/90f58624-1f32-46a2-afc9-ad8f010e2703.jpg"/>]
I don't understand how you could get that far and not know how to get the src attribute:
import requests
from bs4 import BeautifulSoup
base = 'https://kvartiry-bolgarii.ru/neveroyatnaya-kvartira-s-vidom-na-more-tip-pentkhaus-o26253'
rs = requests.get(base)
root = BeautifulSoup(rs.content, 'html.parser')
urls = root.select('#slider > li > img[src]')
for url in urls:
print( base+url['src'] )

How to get image from webpage

I am editing a Python script which gets images from a webpage (which needs a private login, so there is no point in me posting a link to it). It uses the BeautifulSoup library, and the original script is here.
What I would like to do is to customize this script to get a single image, the HTML tag of which has the id attribute id="fimage". It has no class. Here is the code:
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.find(id="fimage")]
print (images)
print (str(len(images)) + " images found.")
# print 'Downloading images to current working directory.'
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urlretrieve(each, filename)
return image_links
get_images('http://myurl');
#a standard call looks like this
#get_images('http://www.wookmark.com')
For some reason, this doesn't seem to work. When run on the command line, it produces the output:
[]
0 images found.
UPDATE:
Okay so I have changed the code and now the script seems to find the image I'm trying to download, but it throws another error when run and can't download it.
Here is the updated code:
from bs4 import BeautifulSoup
from urllib import request
import urllib.parse
import urllib.error
from urllib.request import urlopen
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
image = soup.find(id="logo", src=True)
if image is None:
print('No images found.')
return
image_link = image['src']
filename = image_link.split('/')[-1]
request.urlretrieve(filename)
return image_link
try:
get_images('https://pypi.python.org/pypi/ClientForm/0.2.10');
except ValueError as e:
print("File could not be retrieved.", e)
else:
print("It worked!")
#a standard call looks like this
#get_images('http://www.wookmark.com')
When run on the command line the output is:
File could not be retrieved. unknown url type: 'python-logo.png'
soup.find(id="fimage") returns one result, not a list. You are trying to loop over that one element, which means it'll try and list the child nodes, and there are none.
Simply adjust your code to take into account you only have one result; remove all the looping:
image = soup.find(id="fimage", src=True)
if image is None:
print('No matching image found')
return
image_link = image['src']
filename = image_link.split('/')[-1]
urlretrieve(each, filename)
I refined the search a little; by adding src=True you only match a tag if it has a src attribute.

How to Get all the image links & download using python

This is my code
from bs4 import BeautifulSoup
import urllib.request
import re
print("Enter the link \n")
link = input()
url = urllib.request.urlopen(link)
content = url.read()
soup = BeautifulSoup(content)
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.jpg'))]
print (len(links))
#print (links)
print("\n".join(links))
When i give the input as
http://keralapals.com/emmanuel-malayalam-movie-stills
I get the output
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-0.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-1.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-2.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-3.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-4.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-5.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-6.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-7.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-8.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-9.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-10.jpg
But , when i give the input
http://www.raagalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
or
http://www.ragalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
It Produces no output :(
SO , I need to get the links of its original pics . This page just contain thumbnails . when we click the those thumbnails , we get the original image links .
I need to get those image links and need to download :(
Any help is really welcome .. :)
Thank You
Muneeb K
The problem is that in second case the actual image urls ending with .jpg are inside the src attribute of img tags:
<a href="/actress/13192/regina-cassandra-at-big-green-ganesha-2014/image61.aspx">
<img src="http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg" alt="Regina Cassandra" title="Regina Cassandra at BIG Green Ganesha 2014">
</a>
As one option, you can support this type of links too:
links = [a['href'] for a in soup.find_all('a', href=re.compile('http.*\.jpg'))]
imgs = [img['src'] for img in soup.find_all('img', src=lambda x: x.endswith('.jpg'))]
links += imgs
print (len(links))
print("\n".join(links))
For this url it prints:
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha105t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha106t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha107t.jpg
...
Note that instead of a regular expression I'm passing a function where I check that the src attribute ends with .jpg.
Hope it helps and you've learned something new today.

Categories

Resources