I'm playing around in python trying to download some images from imgur. I've been using the urrlib and urllib.retrieve but you need to specify the extension when saving the file. This isn't a problem for most posts since the link has for example .jpg in it, but I'm not sure what to do when the extension isn't there. My question is if there is any way to determine the image format of the file before downloading it. The question is mostly imgur specific, but I wouldn't mind a solution for most image-hosting sites.
Thanks in advance
You can use imghdr.what(filename[, h]) in Python 2.7 and Python 3 to determine the image type.
Read here for more info, if you're using Python 2.7.
Read here for more info, if you're using Python 3.
Assuming the picture has no file extension, there's no way to determine which type it is before you download it. All image formats sets their initial bytes to a particular value. To inspect these 'magic' initial bytes check out https://github.com/ahupp/python-magic - it matches the initial bytes against known image formats.
The code below downloads a picture from imgur and determines which file type it is.
import magic
import requests
import shutil
r = requests.get('http://i.imgur.com/yed5Zfk.gif', stream=True) ##Download picture
if r.status_code == 200:
with open('~/Desktop/picture', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
print magic.from_file('~/Desktop/picture') ##Determine type
## Prints: 'GIF image data, version 89a, 360 x 270'
Related
I have a task that requires me to get data from an image upload (jpg or png), resize it based on the requirement, and then transform it into pdf and then store in s3.
The file comes in as ByteIO
I have Pillow available so I can resize the image with it
Now the file type is class 'PIL.Image.Image' and I don't know how to proceed.
I found img2pdf library on https://gitlab.mister-muffin.de/josch/img2pdf, but I don't know how to use it when I have PIL format (use tobytes()?)
The s3 upload also looks like a file-like object, so I don't want to save it into a temp file before loading it again. Do I even need img2pdf in this case then?
How do I achieve this goal?
EDIT: I tried using tobytes() and upload to s3 directly. Upload was successful. However, when downloading to see the content, it shows an empty page. It seems like the file data is not written into the pdf file
EDIT 2: Actually went on the s3 and check the file stored. When I download it and open it, it shows cannot be opened
EDIT 3: I don't really have working code as I'm still experimenting what could work, but here's a gist
data = request.FILES['file'].file # where the data is
im = Image.open(data)
(width, height) = (im.width // 2, im.height // 2) # example action I wanna take with Pillow
data = im_resized.tobytes()
# potential step for using img2pdf here but I don't know how
# img2pdf.convert(data) # this fails because "ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO..."
# img2pdf.convert(im_resized) # this also fails because "TypeError: Neither implements read() nor is str or bytes"
upload_to_s3(data) # some function that utilizes boto3 to upload to s3
The problem is that u use Image.Image object instead of JPEG or something like it
Try this:
bytes_io = io.BytesIO()
image.save(bytes_io, 'PNG')
with open(output_pdf, "wb") as f:
f.write(img2pdf.convert(bytes_io.getvalue()))
I have a PDF file consisting of around 20-25 pages. The aim of this tool is to split the PDF file into pages (using PyPdf2), save every PDF page in a directory (using PyPdf2), convert the PDF pages into images (using ImageMagick) and then perform some OCR on them using tesseract (using PIL and PyOCR) to extract data. The tool will eventually be a GUI through tkinter so the users can perform the same operation many times by clicking on a button. Throughout my heavy testing, I have noticed that if the whole process is repeated around 6-7 times, the tool/python script crashes by showing not responding on Windows. I have performed some debugging, but unfortunately there is no error thrown. The memory and CPU are good so no issues there as well. I was able to narrow down the problem by observing that, before reaching to the tesseract part, PyPDF2 and ImageMagick are failing when they are run together. I was able to replicate the problem by simplifying it to the following Python code:
from wand.image import Image as Img
from PIL import Image as PIL
import pyocr
import pyocr.builders
import io, sys, os
from PyPDF2 import PdfFileWriter, PdfFileReader
def splitPDF (pdfPath):
#Read the PDF file that needs to be parsed.
pdfNumPages =0
with open(pdfPath, "rb") as pdfFile:
inputpdf = PdfFileReader(pdfFile)
#Iterate on every page of the PDF.
for i in range(inputpdf.numPages):
#Create the PDF Writer Object
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("tempPdf%s.pdf" %i, "wb") as outputStream:
output.write(outputStream)
#Get the number of pages that have been split.
pdfNumPages = inputpdf.numPages
return pdfNumPages
pdfPath = "Test.pdf"
for i in range(1,20):
print ("Run %s\n--------" %i)
#Split the PDF into Pages & Get PDF number of pages.
pdfNumPages = splitPDF (pdfPath)
print(pdfNumPages)
for i in range(pdfNumPages):
#Convert the split pdf page to image to run tesseract on it.
with Img(filename="tempPdf%s.pdf" %i, resolution=300) as pdfImg:
print("Processing Page %s" %i)
I have used the with statement to handle the opening and closing of files correctly, so there should be no memory leaks there. I have tried running the splitting part separately and the image conversion part separately, and they work fine when ran alone. However when the codes are combined, it will fail after iterating for around 5-6 times. I have used try and exception blocks but no error is captured. Also I am using the latest version of all the libraries. Any help or guidance is appreciated.
Thank you.
For future reference, the problem was due to the 32-bit version of ImageMagick as mentioned in one of the comments (thanks to emcconville). Uninstalling Python and ImageMagick 32-bit versions and installing both 64-bit versions fixed the problem. Hope this helps.
I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.
Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers
I want to do an image form submission, and I want to validate that the image was submitted is an image server side, which is running python. Is there a simple way to do this in pure python?
A simple and naive way to do it would be with libmagic (for example the one at https://github.com/ahupp/python-magic). A better way, but it's not native Python and is a very extensive library, would be to use PIL http://www.pythonware.com/products/pil/.
Use PIL:
import sys
import Image
for infile in sys.argv[1:]:
try:
im = Image.open(infile)
print infile, im.format, "%dx%d" % im.size, im.mode
except IOError:
pass
From the docs:
The Python Imaging Library supports a wide variety of image file
formats. To read files from disk, use the open function in the Image
module. You don't have to know the file format to open a file. The
library automatically determines the format based on the contents of
the file.
I am currently using PIL.
from PIL import Image
try:
im=Image.open(filename)
# do stuff
except IOError:
# filename not an image file
However, while this sufficiently covers most cases, some image files like, xcf, svg and psd are not being detected. Psd files throws an OverflowError exception.
Is there someway I could include them as well?
I have just found the builtin imghdr module. From python documentation:
The imghdr module determines the type
of image contained in a file or byte
stream.
This is how it works:
>>> import imghdr
>>> imghdr.what('/tmp/bass')
'gif'
Using a module is much better than reimplementing similar functionality
UPDATE: imghdr is deprecated as of python 3.11
In addition to what Brian is suggesting you could use PIL's verify method to check if the file is broken.
im.verify()
Attempts to determine if the file is
broken, without actually decoding the
image data. If this method finds any
problems, it raises suitable
exceptions. This method only works on
a newly opened image; if the image has
already been loaded, the result is
undefined. Also, if you need to load
the image after using this method, you
must reopen the image file. Attributes
Additionally to the PIL image check you can also add file name extension check like this:
filename.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif'))
Note that this only checks if the file name has a valid image extension, it does not actually open the image to see if it's a valid image, that's why you need to use additionally PIL or one of the libraries suggested in the other answers.
A lot of times the first couple chars will be a magic number for various file formats. You could check for this in addition to your exception checking above.
One option is to use the filetype package.
Installation
python -m pip install filetype
Advantages
Fast: Does its work by loading only the first few bytes of your image (check on the magic number)
Supports different mime type: Images, Videos, Fonts, Audio, Archives.
Example
filetype >= 1.0.7
import filetype
filename = "/path/to/file.jpg"
if filetype.is_image(filename):
print(f"{filename} is a valid image...")
elif filetype.is_video(filename):
print(f"{filename} is a valid video...")
filetype <= 1.0.6
import filetype
filename = "/path/to/file.jpg"
if filetype.image(filename):
print(f"{filename} is a valid image...")
elif filetype.video(filename):
print(f"{filename} is a valid video...")
Additional information on the official repo: https://github.com/h2non/filetype.py
Update
I also implemented the following solution in my Python script here on GitHub.
I also verified that damaged files (jpg) frequently are not 'broken' images i.e, a damaged picture file sometimes remains a legit picture file, the original image is lost or altered but you are still able to load it with no errors. But, file truncation cause always errors.
End Update
You can use Python Pillow(PIL) module, with most image formats, to check if a file is a valid and intact image file.
In the case you aim at detecting also broken images, #Nadia Alramli correctly suggests the im.verify() method, but this does not detect all the possible image defects, e.g., im.verify does not detect truncated images (that most viewers often load with a greyed area).
Pillow is able to detect these type of defects too, but you have to apply image manipulation or image decode/recode in or to trigger the check. Finally I suggest to use this code:
from PIL import Image
try:
im = Image.load(filename)
im.verify() #I perform also verify, don't know if he sees other types o defects
im.close() #reload is necessary in my case
im = Image.load(filename)
im.transpose(Image.FLIP_LEFT_RIGHT)
im.close()
except:
#manage excetions here
In case of image defects this code will raise an exception.
Please consider that im.verify is about 100 times faster than performing the image manipulation (and I think that flip is one of the cheaper transformations).
With this code you are going to verify a set of images at about 10 MBytes/sec with standard Pillow or 40 MBytes/sec with Pillow-SIMD module (modern 2.5Ghz x86_64 CPU).
For the other formats xcf,.. you can use Imagemagick wrapper Wand, the code is as follows:
Check the Wand documentation: here, to installation: here
im = wand.image.Image(filename=filename)
temp = im.flip;
im.close()
But, from my experiments Wand does not detect truncated images, I think it loads lacking parts as greyed area without prompting.
I red that Imagemagick has an external command identify that could make the job, but I have not found a way to invoke that function programmatically and I have not tested this route.
I suggest to always perform a preliminary check, check the filesize to not be zero (or very small), is a very cheap idea:
import os
statfile = os.stat(filename)
filesize = statfile.st_size
if filesize == 0:
#manage here the 'faulty image' case
On Linux, you could use python-magic which uses libmagic to identify file formats.
AFAIK, libmagic looks into the file and tries to tell you more about it than just the format, like bitmap dimensions, format version etc.. So you might see this as a superficial test for "validity".
For other definitions of "valid" you might have to write your own tests.
You could use the Python bindings to libmagic, python-magic and then check the mime types. This won't tell you if the files are corrupted or intact but it should be able to determine what type of image it is.
Adapting from Fabiano and Tiago's answer.
from PIL import Image
def check_img(filename):
try:
im = Image.open(filename)
im.verify()
im.close()
im = Image.open(filename)
im.transpose(Image.FLIP_LEFT_RIGHT)
im.close()
return True
except:
print(filename,'corrupted')
return False
if not check_img('/dir/image'):
print('do something')
Extension of the image can be used to check image file as follows.
import os
for f in os.listdir(folderPath):
if (".jpg" in f) or (".bmp" in f):
filePath = os.path.join(folderPath, f)
format = [".jpg",".png",".jpeg"]
for (path,dirs,files) in os.walk(path):
for file in files:
if file.endswith(tuple(format)):
print(path)
print ("Valid",file)
else:
print(path)
print("InValid",file)