My code only returns an empty string, and I have no idea why.
import urllib2
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<a img=')
end = page.find('>', start)
img = page[start:end]
return img
It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.
Consider using BeautifulSoup to parse your HTML:
from BeautifulSoup import BeautifulSoup
import urllib
url = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
print img['src']
You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...
Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:
<img src="smiley.gif" alt="Smiley face" height="42" width="42">
What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<img ')
end = page.find('>', start)
img = page[start:end+1]
return img
Article on screen scraping with ruby:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Its not scraping images but its a good article and may help.
Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:
http://scrapy.org/ is a very good framework for extracting data from web pages. As it looks like you're a beginner, it might a bit overkill.
Learn regular expressions to extract the information: http://docs.python.org/library/re.html and Learning Regular Expressions
Use http://www.crummy.com/software/BeautifulSoup/ to parse data from the result of page.read().
Some instructions that might be of help:
Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.
Use Beautiful Soup to parse the html:
from BeautifulSoup import BeautifulSoup
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
soap = BeautifulSoap(html)
imgs = soup.findAll("img")
items = []
for img in imgs:
print img['src'] #print the image location
items.append(img['src']) #store the locations for downloading later
Related
Is there a way to get all the flags from https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags using python code?
I tried with pd.read_html and did not succeed. I tried scraping but it got so messy and I couldn't do it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
# Scrap webpage
soup = BeautifulSoup(page.content, 'html.parser')
flags = soup.find_all('a', attrs={'class': "image"})
Would be nice if I can download them to a specific folder too!
Thanks in advance!
Just as alternative to yours and the well described approach of MattieTK you could also use css selectors to select your elements more specific:
soup.select('img[src*="/Flag_of"]')
Iterate the ResultSet, pick the src and use a function to download the images:
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
Example
import requests
from bs4 import BeautifulSoup
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
file_name = url.split('/')[-1]
with open(file_name,'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
else:
print('Image Couldn\'t be retrieved',url)
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content)
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
In your example flags is an array of anchor tags including the img tags.
What you want is a way to get each individual src attribute from the image tag.
You can achieve this by looping over the results of your soup.find_all like so. Each flag is separate, which allows you to get the contents of the flag (the image tag) and then the value of the src attribute.
for flag in soup.find_all('a', attrs={'class': "image"}):
src = flag.contents[0]['src'])
You can then work on downloading each of these to a file inside the loop.
I am trying to test my website using selenium. And I want to check all images have filled alt attribute or not. So how to check this.
<img src="/media/images/biren.png" alt="N. Biren" class ="img-fluid" >
I'm not well versed with selenium, but this can be done easily using requests with bs4 (simple web-scraping).
Please find an example code below:
import requests, bs4
url = 'HTML URL HERE!'
# get the url html txt
page = requests.get(url).text
# parse the html txt
soup = bs4.BeautifulSoup(page, 'html.parser')
# get all img tags
for image in soup.find_all('img'):
try:
# print alt text if exists
print(image['alt'])
except:
# print the complete img tag if not
print(image)
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I am trying to automate a process of downloading imgur files, and for this purpose I am using beautifulsoup to get the link however to be honest I am pretty lost on why this doesn't work, as according to my research it should:
soup = BeautifulSoup("http://imgur.com/ha0WYYQ")
imageUrl = soup.select('.image a')[0]['href']
The code above just returns an empty list, and therefore an error. I tried to modify it, but to no avail. Any and all input is appreciated.
<div class="post-image">
<a href="//i.imgur.com/ha0WYYQ.jpg" class="zoom">
<img src="//i.imgur.com/ha0WYYQ.jpg" alt="Frank in his bb8 costume" itemprop="contentURL">
</a>
</div>
this is the image tag, the "post-image" is a single word, can not be separated.
imageUrl = soup.select('.post-image a')[0]['href']
shortcut for select one tag:
imageUrl = soup.select_one('.post-image a')['href']
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
There are a few things wrong with your approach:
BeautifulSoup does not expect an url, so you will need to use a library to fetch the HTML stream first; and
Your selector seems invalid based on what I can see it should be .post-image a.
r = urllib.urlopen('http://imgur.com/ha0WYYQ').read()
soup = BeautifulSoup(r,'lxml')
soup.select('.post-image a')[0]['href']
Or more elegant:
with urllib.urlopen('http://imgur.com/ha0WYYQ') as f:
r = f.read()
soup = BeautifulSoup(r,'lxml')
result = soup.select('.post-image a')[0]['href']
This is my code
from bs4 import BeautifulSoup
import urllib.request
import re
print("Enter the link \n")
link = input()
url = urllib.request.urlopen(link)
content = url.read()
soup = BeautifulSoup(content)
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.jpg'))]
print (len(links))
#print (links)
print("\n".join(links))
When i give the input as
http://keralapals.com/emmanuel-malayalam-movie-stills
I get the output
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-0.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-1.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-2.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-3.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-4.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-5.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-6.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-7.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-8.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-9.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-10.jpg
But , when i give the input
http://www.raagalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
or
http://www.ragalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
It Produces no output :(
SO , I need to get the links of its original pics . This page just contain thumbnails . when we click the those thumbnails , we get the original image links .
I need to get those image links and need to download :(
Any help is really welcome .. :)
Thank You
Muneeb K
The problem is that in second case the actual image urls ending with .jpg are inside the src attribute of img tags:
<a href="/actress/13192/regina-cassandra-at-big-green-ganesha-2014/image61.aspx">
<img src="http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg" alt="Regina Cassandra" title="Regina Cassandra at BIG Green Ganesha 2014">
</a>
As one option, you can support this type of links too:
links = [a['href'] for a in soup.find_all('a', href=re.compile('http.*\.jpg'))]
imgs = [img['src'] for img in soup.find_all('img', src=lambda x: x.endswith('.jpg'))]
links += imgs
print (len(links))
print("\n".join(links))
For this url it prints:
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha105t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha106t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha107t.jpg
...
Note that instead of a regular expression I'm passing a function where I check that the src attribute ends with .jpg.
Hope it helps and you've learned something new today.