python beautifulsoup capture image - python

using below script i am trying to capture the image and then save it on disk. And then have to save the local path in the DB.
I have writting a simple code to capture the image from webpage:-
import urllib2
from os.path import basename
from urlparse import urlsplit
from bs4 import BeautifulSoup
url = "http://www.someweblink.com/path_to_the_target_webpage"
urlContent = urllib2.urlopen(url).read()
soup = BeautifulSoup(''.join(urlContent))
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
The page code for image :-
<div class="single-post-thumb"> <img width="620" height="330" src="http://ccccc.com/wp-content/uploads/2016/05/weerewr.jpg"/>

If you just want to download the image using the url of the image you can try this
import urllib
img_url = "Image url goes here"
urllib.urlretrieve(img_url,'test.jpg')
It will save your image with test.jpg name in the current working directory.
Note: mention full url of the image sometimes "src" attribute of the img tag contains relative urls.

Related

Downloading pdf from URL with urllib (getting weird html output instead)

I am trying to download pdfs from several pdf urls.
An example: https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf
This url directly opens into the PDF on my browser.
However, when I use this code to download it using the link, it returns an HTML file given below.
link = "https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf"
urllib.request.urlretrieve(link, f"/content/drive/MyDrive/Research/pdfs/1.pdf")
The resulting "pdf" file or HTML code file is downloaded instead:
How do I solve this issue? Appreciate any help, thanks!
You can use BeautifulSoup or lxml to find <iframe> and get src - and use it to download file
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup as BS
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
response = urllib.request.urlopen(url)
soup = BS(response.read(), 'html.parser')
iframe = soup.find('iframe')
url = iframe['src']
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)
Eventually you can check few file to see if all use the same https://d2x0djib3vzbzj.cloudfront.net/ and simply replace it in url.
import urllib.request
import urllib.parse
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
url = url.replace('https://www.fasb.org/page/showpdf?path=',
'https://d2x0djib3vzbzj.cloudfront.net/')
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)

BeautifulSoup: No connection adapters were found error

My goal is to search google for a specified string and save the image from the found url. I have been following online tutorials but keep getting the same InvalidSchema error and do not know why.
from PIL import Image
from bs4 import BeautifulSoup
import requests
text= "animal+crossing"
html_page = requests.get("https://www.google.com/search?q="+text)
soup = BeautifulSoup(html_page.text, 'html.parser')
image = soup.find('img')
img_url = image['src']
img = Image.open(requests.get(img_url, stream = True).raw)
img.save('image.jpg')
The image url that you're grabbing in this step: img_url = image['src'] is not actually a valid url. Here's the value I see for img_url when I run your code:
data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Find and download image on multiple pages

If I write http://www.chictopia.com/photo/show/3
I can get proper image file.
However, if I set range to crawl image within multiple web page with using for loop
I can't get image file it seems 0bytes file is downloaded
f'http://www.chictopia.com/photo/show/+{x}
why I can get 0bytes image file and could anyone explain how to parse image of multiple page.
Thank you
import re
import requests
from bs4 import BeautifulSoup
for x in range (3,6):
response = requests.get(f'http://www.chictopia.com/photo/show/+{x}')
print (response)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[400]+[.](jpg))$', url)
if not filename:
print("fail".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(response, url)
response = requests.get(url)
f.write(response.content)
try this
modified the regex pattern and used changed the call to proper image url.
now this code will save all the images containing _400.jpg in their link as following name.
import re
import requests
from bs4 import BeautifulSoup
import shutil
for x in range (3,6):
response = requests.get(f'http://www.chictopia.com/photo/show/+{x}')
# print (response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.findall(r'(.+_400\.jpg)', url)
if len(filename) != 0:
image = filename[0]
image_name = f"image_{image.split('/')[-1]}"
response = requests.get(image, stream=True)
with open(image_name, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
print(f'Saved : {image_name}')
for example.
http://images2.chictopia.com/photos/mikajones/2162299642/2162299642_400.jpg -> as image_2162299642_400.jpg
now whats wrong with your code:
you used wrong regex and took the matching value which is not the complete url of the image that's why you are getting the null value for image size (you are not even calling the image url).
all fixed.

My code for webscraping media with python from a game is not working for me,

I have been trying to download pictures and audio from this web based game. It has not yet worked for me. Is there any way I can get help here, even if it is a tutorial?
import re
import requests
from bs4 import BeautifulSoup
def extract_images(site):
""" Extract images from the url given"""
response=requests.get(site)
soup=BeautifulSoup(response.text, 'lxml.parser')
image_tags=soup.find_all('image')
# extract the urls
urls=[image['src'] for img in img_tags]
for url in urls:
pattern=r'/([\w_-]+[.](jpg|gif|png))$' # pattern to extract image files
filename=re.search(pattern, url)
if not filename:
print("No filename {}", url)
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url: # relative reques?
url="{}{}".format(site, url)
print(url)
response=requests.get(url)
f.write(response.content)
if __name__=="__main__":
site="https://heartofvegasslots.productmadness.com/"
extract_images(site)
Great start on it and almost had it working.
There are a couple of lines that you should change to get it working.
First: This line needs wont work in its current form as the list comprehension is calling the wrong variables
`urls=[image['src'] for img in img_tags]`
also if you call image['src'] and 'src' doesnt exist it will fail. try instead using the get method.
maybe try something like this:
`urls=[img.get('src') for img in image_tags]`
Second line
image_tags=soup.find_all('image')
to
image_tags=soup.find_all('img')
I have changed it to 'img' instead of 'image' as those are the name of the html tags for images.
I couldnt get it working for the website provided, but that looks like it doesnt have any img tags.
This is my code, and a website that i have tried with it working.
import re
import requests
import lxml
from bs4 import BeautifulSoup
def extract_images(site):
""" Extract images from the url given"""
response=requests.get(site)
soup=BeautifulSoup(response.text, 'lxml')
image_tags=soup.find_all('img')
print(image_tags)
# extract the urls
urls=[img.get('src') for img in image_tags]
for url in urls:
if url:
pattern=r'/([\w_-]+[.](jpg|gif|png))$' # pattern to extract image files
filename=re.search(pattern, url)
if not filename:
print("No filename {}", url)
continue
with open(filename.group(1), 'wb') as f:
print(f'writing {filename}')
if 'http' not in url: # relative reques?
url="{}{}".format(site, url)
print(url)
response=requests.get(url)
f.write(response.content)
if __name__=="__main__":
site="https://antennatestlab.com/"
extract_images(site)

How to get image from webpage

I am editing a Python script which gets images from a webpage (which needs a private login, so there is no point in me posting a link to it). It uses the BeautifulSoup library, and the original script is here.
What I would like to do is to customize this script to get a single image, the HTML tag of which has the id attribute id="fimage". It has no class. Here is the code:
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.find(id="fimage")]
print (images)
print (str(len(images)) + " images found.")
# print 'Downloading images to current working directory.'
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urlretrieve(each, filename)
return image_links
get_images('http://myurl');
#a standard call looks like this
#get_images('http://www.wookmark.com')
For some reason, this doesn't seem to work. When run on the command line, it produces the output:
[]
0 images found.
UPDATE:
Okay so I have changed the code and now the script seems to find the image I'm trying to download, but it throws another error when run and can't download it.
Here is the updated code:
from bs4 import BeautifulSoup
from urllib import request
import urllib.parse
import urllib.error
from urllib.request import urlopen
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
image = soup.find(id="logo", src=True)
if image is None:
print('No images found.')
return
image_link = image['src']
filename = image_link.split('/')[-1]
request.urlretrieve(filename)
return image_link
try:
get_images('https://pypi.python.org/pypi/ClientForm/0.2.10');
except ValueError as e:
print("File could not be retrieved.", e)
else:
print("It worked!")
#a standard call looks like this
#get_images('http://www.wookmark.com')
When run on the command line the output is:
File could not be retrieved. unknown url type: 'python-logo.png'
soup.find(id="fimage") returns one result, not a list. You are trying to loop over that one element, which means it'll try and list the child nodes, and there are none.
Simply adjust your code to take into account you only have one result; remove all the looping:
image = soup.find(id="fimage", src=True)
if image is None:
print('No matching image found')
return
image_link = image['src']
filename = image_link.split('/')[-1]
urlretrieve(each, filename)
I refined the search a little; by adding src=True you only match a tag if it has a src attribute.

Categories

Resources