This is my code
from bs4 import BeautifulSoup
import urllib.request
import re
print("Enter the link \n")
link = input()
url = urllib.request.urlopen(link)
content = url.read()
soup = BeautifulSoup(content)
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.jpg'))]
print (len(links))
#print (links)
print("\n".join(links))
When i give the input as
http://keralapals.com/emmanuel-malayalam-movie-stills
I get the output
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-0.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-1.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-2.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-3.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-4.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-5.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-6.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-7.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-8.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-9.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-10.jpg
But , when i give the input
http://www.raagalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
or
http://www.ragalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
It Produces no output :(
SO , I need to get the links of its original pics . This page just contain thumbnails . when we click the those thumbnails , we get the original image links .
I need to get those image links and need to download :(
Any help is really welcome .. :)
Thank You
Muneeb K
The problem is that in second case the actual image urls ending with .jpg are inside the src attribute of img tags:
<a href="/actress/13192/regina-cassandra-at-big-green-ganesha-2014/image61.aspx">
<img src="http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg" alt="Regina Cassandra" title="Regina Cassandra at BIG Green Ganesha 2014">
</a>
As one option, you can support this type of links too:
links = [a['href'] for a in soup.find_all('a', href=re.compile('http.*\.jpg'))]
imgs = [img['src'] for img in soup.find_all('img', src=lambda x: x.endswith('.jpg'))]
links += imgs
print (len(links))
print("\n".join(links))
For this url it prints:
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha105t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha106t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha107t.jpg
...
Note that instead of a regular expression I'm passing a function where I check that the src attribute ends with .jpg.
Hope it helps and you've learned something new today.
Related
The code below is an example.
On the first img. You can see the class is (class_name) and the src= contains a link. But the rest of the img TAGS you will see the classes are different, and there is no src attribute there is data-src only.
So when I try to get the links, I am only able to get the links either for the first one or the rest of the links only if I change the ( get('src') to get('data-src') ).
Is there any way to get the links only as text?
import requests
from bs4 import BeautifulSoup
url = 'website.com'
soup = BeautifulSoup.get(url)
links = {
'<img class="class_name" src="https://website1.png"/>',
'<img class="class_name late" data-src="https://website2.png"/>',
'<img class="class_name late" data-src="https://website3.png"/>',
}
for link in links:
link.find('img', class_='class_name').get('src')
print(link)
Thanks
I need the output like this:
https://website1.png
https://website2.png
https://website3.png
Simply select all of the images, iterate over the ResultSet and check if an attribute is available to extract its value and print it or append it do a list or set in case of avoiding duplicates.
Example
from bs4 import BeautifulSoup
html = '''
<img class="class_name" src="https://website1.png"/>
<img class="class_name late" data-src="https://website2.png"/>
<img class="class_name late" data-src="https://website3.png"/>
'''
soup = BeautifulSoup(html)
for link in soup.select('img.class_name'):
if link.get('src'):
print(link.get('src'))
else:
print(link.get('data-src'))
Output
https://website1.png
https://website2.png
https://website3.png
I am trying to test my website using selenium. And I want to check all images have filled alt attribute or not. So how to check this.
<img src="/media/images/biren.png" alt="N. Biren" class ="img-fluid" >
I'm not well versed with selenium, but this can be done easily using requests with bs4 (simple web-scraping).
Please find an example code below:
import requests, bs4
url = 'HTML URL HERE!'
# get the url html txt
page = requests.get(url).text
# parse the html txt
soup = bs4.BeautifulSoup(page, 'html.parser')
# get all img tags
for image in soup.find_all('img'):
try:
# print alt text if exists
print(image['alt'])
except:
# print the complete img tag if not
print(image)
I'm working on a project and I'm trying to extract the pictures' URL from a website. I'm a noob at this so please bear with me. Based on the HTML code, the class of the pictures that I want is "fotorama__img". However, when I execute my code, it doesn't seem to work. Anyone knows why that's the case? Also, how come the src attribute doesn't contain the whole URL, just a part of it? Example: the link to the image is https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg but the src attribute of the img tag is "/files_SYS/images/System/sysThumb/SYS-120U-TNR_main.png".
Here is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR")
soup = BeautifulSoup(page.content,'lxml')
images = soup.find_all("img", {"class": "fotorama__img"})
for image in images:
print(image.get("src"))
And here is the picture of the HTML code for the page
Thank you for your help!
The class is added dynamically via JavaScript, so beautifulsoup doesn't see it. To extract the images from this site, you can do:
import requests
from bs4 import BeautifulSoup
page = requests.get(
"https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR"
)
soup = BeautifulSoup(page.content, "lxml")
images = [
"https://www.supermicro.com" + a["href"]
for a in soup.select(".fotorama > a")
]
print(*images, sep="\n")
Prints:
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_main.png
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_angle.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_top.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_rear.jpg
I am trying to make a web crawler that will give me all the links to images in the given URL, but many of the images that I found, while looking in the page source and searching in the page source with CTRL+F, were not printed in the output.
my code is:
import requests
from bs4 import BeautifulSoup
import urllib
import os
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
i = 0
while i < 1:
source_code = requests.get(website_url) # The source code will have the page source (<html>.......</html>
plain_text = source_code.text # Gets only the text from the source code
soup = BeautifulSoup(plain_text, "html5lib")
for link in soup.findAll('img'): # A loop which looking for all the images in the website
src = link.get('src') # I want to get the image URL and its located under 'src' in HTML
if 'http://' not in src and 'https://' not in src:
if src[0] != '/':
src = '/' + src
src = website_url + src
print src
i += 1
How should I make my code print every image that is in an <img> in the HTML page source?
For example: the website has this HTML code:
<img src="http://shippuden.co.il/wp-content/uploads/newkadosh21.jpg" *something* >
But the script didn't print its src.
The script is printing the src in <img .... src="...">
How should I improve my code to find all the images?
Taking a look on the main page of the domain you posted on the example, I see that the image you refer is not on src, but on data-lazy-src attribute.
So you should parse both attributes like:
src = link.get('src')
lazy_load_src = link.get('data-lazy-src')
Actually when running the example software you showed, the img src for the image newkadosh21 is printed, but it is an base64 like:
src=""
My code only returns an empty string, and I have no idea why.
import urllib2
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<a img=')
end = page.find('>', start)
img = page[start:end]
return img
It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.
Consider using BeautifulSoup to parse your HTML:
from BeautifulSoup import BeautifulSoup
import urllib
url = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
print img['src']
You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...
Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:
<img src="smiley.gif" alt="Smiley face" height="42" width="42">
What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<img ')
end = page.find('>', start)
img = page[start:end+1]
return img
Article on screen scraping with ruby:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Its not scraping images but its a good article and may help.
Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:
http://scrapy.org/ is a very good framework for extracting data from web pages. As it looks like you're a beginner, it might a bit overkill.
Learn regular expressions to extract the information: http://docs.python.org/library/re.html and Learning Regular Expressions
Use http://www.crummy.com/software/BeautifulSoup/ to parse data from the result of page.read().
Some instructions that might be of help:
Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.
Use Beautiful Soup to parse the html:
from BeautifulSoup import BeautifulSoup
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
soap = BeautifulSoap(html)
imgs = soup.findAll("img")
items = []
for img in imgs:
print img['src'] #print the image location
items.append(img['src']) #store the locations for downloading later