How to test if a webpage is an image - python

Sorry that the title wasn't very clear, basically I have a list with a whole series of url's, with the intention of downloading the ones that are pictures. Is there anyway to check if the webpage is an image, so that I can just skip over the ones that arent?
Thanks in advance

You can use requests module. Make a head request and check the content type. Head request will not download the response body.
import requests
response = requests.head(url)
print response.headers.get('content-type')

There is no reliable way. But you could find a solution that might be "good enough" in your case.
You could look at the file extension if it is present in the url e.g., .png, .jpg could indicate an image:
>>> import os
>>> name = url2filename('http://example.com/a.png?q=1')
>>> os.path.splitext(name)[1]
'.png'
>>> import mimetypes
>>> mimetypes.guess_type(name)[0]
'image/png'
where url2filename() function is defined here.
You could inspect Content-Type http header:
>>> import urllib.request
>>> r = urllib.request.urlopen(url) # make HTTP GET request, read headers
>>> r.headers.get_content_type()
'image/png'
>>> r.headers.get_content_maintype()
'image'
>>> r.headers.get_content_subtype()
'png'
You could check the very beginning of the http body for magic numbers indicating image files e.g., jpeg may start with b'\xff\xd8\xff\xe0' or:
>>> prefix = r.read(8)
>>> prefix # .png image
b'\x89PNG\r\n\x1a\n'
As #pafcu suggested in the answer to the related question, you could use imghdr.what() function:
>>> import imghdr
>>> imghdr.what(None, b'\x89PNG\r\n\x1a\n')
'png'

You can use mimetypes https://docs.python.org/3.0/library/mimetypes.html
import urllib
from mimetypes import guess_extension
url="http://example.com/image.png"
source = urllib.urlopen(url)
extension = guess_extension(source.info()['Content-Type'])
print extension
this will return "png"

Related

How do you find the filetype of an image in a url with nonobvious filetype in Python

Certain CDNs like googleusercontent don't (obviously) encode the filenames of images in their urls, so you can't get the filetype from simply using string manipulation like other answers here have suggested. knowing this, how can tell that
https://lh3.googleusercontent.com/pw/AM-JKLURvu-Ro2N3c1vm1PTM3a7Ae5nG3LNWynuKNEeFNBMwH_uWLQJe0q0HmaOzKC0k0gRba10SbonLaheGcNpxROnCenf1YJnzDC3jL-N9fTtZ7u0q5Z-3iURXtrt4GlyeEI3t4KWxprFDqFWRO29sJc8=w440-h248-no
is a gif whilst
https://lh3.googleusercontent.com/pw/AM-JKLXk2WxafqHOi0ZrETUh2vUNkiLyYW1jRmAQsHBmYyVP7Le-KBCSVASCgO2C6_3QbW3LcLYOV_8OefPafyz2i4g8nqpw8xZnIhzDdemd5dFPS5A7dVAGQWx9DIy5aYOGuh06hTrmfhF9mZmITjjTwuc=w1200-h600-no
is a jpg
Building on the responses to this question, you could try:
import requests
from PIL import Image # pillow package
from io import BytesIO
url = "your link"
image = Image.open( BytesIO( requests.get( url ).content))
file_type = image.format
This calls for downloading the entire file, though. If you're looking to do this in bulk, you might want to explore the option in the comment above that mentions "magic bytes"...
Edit:
You can also try to get the image type from the headers of the response to your url:
headers = requests.get(url).headers
file_type =headers.get('Content-Type', "nope/nope").split("/")[1]
# Will print 'nope' if 'Content-Type' header isn't found
print(file_type)
# Will print 'gif' or 'jpeg' for your listed urls
Edit 2:
If you're really only concerned with the file type of the link and not the file itself, you could use the head method instead of the get method of the requests module. It's faster:
headers = requests.head(url).headers
file_type =headers.get('Content-Type', "nope/nope").split("/")[1]

How to handle complex file names simply and write two file name rules into code?

I have a problem with making filename rules in Python image scraper.
Roughly there're two types of image URLs from a site.
First, src="https://cdn2.ettoday.net/images/5694/5694939.jpg"
This one I can split like this:
file_name = image_url.split('/')[-1]
Then, I can get the filename as I wanted.
5694939.jpg
Second, this one seems complicated.
src="https://scontent-tpe1-1.xx.fbcdn.net/v/t1.6435-9/160617071_1533135443546936_5774762828455542817_n.jpg?_nc_cat=107&ccb=1-3&_nc_sid=8bfeb9&_nc_ohc=_qmfrLffEHEAX9MION0&_nc_ht=scontent-tpe1-1.xx&oh=1a66da48cab3dfefa7847d14a88e1099&oe=60F0936C"
Let's say I only want part of it and the ideal result would be like
5774762828455542817_n.jpg
How to split this complicated URL and how to make two or more rules for different image URLs?
try splitting first on "/" and then on "?" and then on "_"
>>> url = 'https://scontent-tpe1-1.xx.fbcdn.net/v/t1.6435-9/160617071_1533135443546936_5774762828455542817_n.jpg?_nc_cat=107&ccb=1-3&_nc_sid=8bfeb9&_nc_ohc=_qmfrLffEHEAX9MION0&_nc_ht=scontent-tpe1-1.xx&oh=1a66da48cab3dfefa7847d14a88e1099&oe=60F0936C'
>>> filename = url.split("/")[-1]
>>> filename
'160617071_1533135443546936_5774762828455542817_n.jpg?_nc_cat=107&ccb=1-3&_nc_sid=8bfeb9&_nc_ohc=_qmfrLffEHEAX9MION0&_nc_ht=scontent-tpe1-1.xx&oh=1a66da48cab3dfefa7847d14a88e1099&oe=60F0936C'
>>> filename = filename.split("?")[0]
>>> filename
'160617071_1533135443546936_5774762828455542817_n.jpg'
>>> filename = filename.split("_")
>>> filename = filename[0] if len(filename)==1 else "_".join(filename[-2:])
>>> filename
'5774762828455542817_n.jpg'
or a two liner would be:
filename = url.split("/")[-1].split("?")[0].split("_")
filename = filename[0] if len(filename)==1 else "_".join(filename[-2:])
this will give '5774762828455542817_n.jpg' as the output
You could use urllib, which will serve your purpose.
import os
from urllib.parse import urlparse
image_url = "https://scontent-tpe1-1.xx.fbcdn.net/v/t1.6435-9/160617071_1533135443546936_5774762828455542817_n.jpg?_nc_cat=107&ccb=1-3&_nc_sid=8bfeb9&_nc_ohc=_qmfrLffEHEAX9MION0&_nc_ht=scontent-tpe1-1.xx&oh=1a66da48cab3dfefa7847d14a88e1099&oe=60F0936C"
data = urlparse(image_url)
print(os.path.basename(data.path)) # Output: 160617071_1533135443546936_5774762828455542817_n.jpg
You can use the standard library urllib for this task. The library provides the function urlparse to parse the URL. But you'd still have to split the path to only get the file name.
import urllib
path = urllib.parse.urlparse(url).path
file_name = path.split('/')[-1]
You can also use the external library yarl which provides an extensive URL class, which has the attribute name with which you can excess the file name directly.
import yarl
file_name = yarl.URL(url).name
To only get 5774762828455542817_n.jpg for this particular URL you can split it and concatenate the two last items again.
>>> '_'.join(file_name.split('_')[-2:])
'5774762828455542817_n.jpg'

How to download an image straight to a bytesIO variable?

I have an existing url of an image,
I want to download the image straight to a variable (no need to actually download it, maybe get it from a response?
The end result will be "download an image into a BytesIO() variable".
What is the correct way to do so?
You can use requests:
import requests
from io import BytesIO
response = requests.get(url)
image_data = BytesIO(response.content)
Note this works in Python 3.X
You could also just duck-type the underlying urllib3 response object, which is for many practical purposes the same interface as a BytesIO anyway.
Example using the PNG of your identicon:
>>> url = "https://www.gravatar.com/avatar/33f6d36c91913f4b6776525a09d131d0?s=32&d=identicon&r=PG&f=1"
>>> resp = requests.get(url, stream=True)
>>> resp.raw
<urllib3.response.HTTPResponse at 0x7fffe88927b8>
>>> resp.raw.read()
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00 \x00\x00\x00 \x08\x06\x00\x00\x00szz\xf4\x00\x00\x00\tpHYs\x00\x00\x0e\xc4\x00\x00\x0e\xc4\x01\x95+\x0e\x1b\x00\x00\x00\xf6IDATX\x85\xedW1\x12\xc20\x0c\x93\xb9\x0em\xc3\xeb\x98)3?b\x87\x9d\xcf\xd1\xa4[\xcd\x06\xd89bz\xe50C\xb4\xe5\xda\xaa\xba\xc8Qlbf\xc6\x0b\xd2.\xa1\x84\xfe\xda\x17\x9f\xa7!\x01\xf1\xfd\xf3\xee\xdc\x81\xb6\xf4Xo\x8al?#\x15\xd0h\xcf\xdbS\x0b\nO\x8f^\xfd\x02\x80\xe98\x81\xa3(\x1b\x81\xfe"k\x84G\xf9\xeet\x98\xa4\x00M#\x81\xb2\x9f\n\xc2\xc8\xc5"\xcb\xf8\n\\\xc0\x1fX\xe0. \xb7\xc0\xd82\xed\xf1b\x04\x08\x0b\xddw\xa0\n }\x17\xe8s\xbe\xd6\xf34\xc8\x9c\xd1|Y\x11.=\xe7&\x0c.w\x0b\xaa\x80*\xc0]\x00\xc5\xbd\xbc\xdcWg\xbd\x01\x9d3\xcdW\xcf\xfc\x07\xd09\xe3n\x81\xbb\x80<\x8aG.\xf6\x04V\xdfo\xcd\r\xfa[\xf7\x1d\xa8\x02h\xbe\xcd\xb2\x1fP};\x82\\Z9\x91\xcd\r\xcas=w4V\x13\xba4\'\xac~B\xcf\x1d\xee\x16\xb8\x0b\xb8\x03\x91\x99Z?\x1eYA8\x00\x00\x00\x00IEND\xaeB`\x82'

urllib: Get name of file from direct download link

Python 3. Probably need to use urllib to do this,
I need to know how to send a request to a direct download link, and get the name of the file it attempts to save.
(As an example, a KSP mod from CurseForge: https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download)
Of course, the file ID (2355387) will be changed. It could be from any project, but always on CurseForge. (If that makes a difference on the way it's downloaded.)
That example link results in the file:
How can I return that file name in Python?
Edit: I should note that I want to avoid saving the file, reading the name, then deleting it if possible. That seems like the worst way to do this.
Using urllib.request, when you request a response from a url, the response contains a reference to the url you are downloading.
>>> from urllib.request import urlopen
>>> url = 'https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download'
>>> response = urlopen(url)
>>> response.url
'https://addons-origin.cursecdn.com/files/2355/387/MechJeb2-2.6.0.0.zip'
You can use os.path.basename to get the filename:
>>> from os.path import basename
>>> basename(response.url)
'MechJeb2-2.6.0.0.zip'
from urllib import request
url = 'file download link'
filename = request.urlopen(request.Request(url)).info().get_filename()

Python: saving image from web to disk

Can I save images to disk using python? An example of an image would be:
Easiest is to use urllib.urlretrieve.
Python 2:
import urllib
urllib.urlretrieve('http://chart.apis.google.com/...', 'outfile.png')
Python 3:
import urllib.request
urllib.request.urlretrieve('http://chart.apis.google.com/...', 'outfile.png')
If your goal is to download a png to disk, you can do so with urllib:
import urllib
urladdy = "http://chart.apis.google.com/chart?chxl=1:|0|10|100|1%2C000|10%2C000|100%2C000|1%2C000%2C000|2:||Excretion+in+Nanograms+per+gram+creatinine+milliliter+(logarithmic+scale)|&chxp=1,0|2,0&chxr=0,0,12.1|1,0,3&chxs=0,676767,13.5,0,lt,676767|1,676767,13.5,0,l,676767&chxtc=0,-1000&chxt=y,x,x&chbh=a,1,0&chs=640x465&cht=bvs&chco=A2C180&chds=0,12.1&chd=t:0,0,0,0,0,0,0,0,0,1,0,0,3,2,4,6,6,9,3,6,5,11,9,10,6,2,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0&chdl=n=87&chtt=William+MD+-+Buprenorphine+Graph"
filename = r"c:\tmp\toto\file.png"
urllib.urlretrieve(urladdy, filename)
In python 3, you will need to use urllib.request.urlretrieve instead of urllib.urlretrieve.
The Google chart API produces PNG files. Just retrieve them with urllib.urlopen(url).read() or something along these lines and safe to a file the usual way.
Full example:
>>> import urllib
>>> url = 'http://chart.apis.google.com/chart?chxl=1:|0|10|100|1%2C000|10%2C000|100%2C000|1%2C000%2C000|2:||Excretion+in+Nanograms+per+gram+creatinine+milliliter+(logarithmic+scale)|&chxp=1,0|2,0&chxr=0,0,12.1|1,0,3&chxs=0,676767,13.5,0,lt,676767|1,676767,13.5,0,l,676767&chxtc=0,-1000&chxt=y,x,x&chbh=a,1,0&chs=640x465&cht=bvs&chco=A2C180&chds=0,12.1&chd=t:0,0,0,0,0,0,0,0,0,1,0,0,3,2,4,6,6,9,3,6,5,11,9,10,6,2,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0&chdl=n=87&chtt=William+MD+-+Buprenorphine+Graph'
>>> image = urllib.urlopen(url).read()
>>> outfile = open('chart01.png','wb')
>>> outfile.write(image)
>>> outfile.close()
As noted in other examples, 'urllib.urlretrieve(url, outfilename)` is even more straightforward, but playing with urllib and urllib2 will surely be instructive for you.

Categories

Resources