How to solve MemoryError using Python 3.7 pdf2image library?

How to solve MemoryError using Python 3.7 pdf2image library? - python

I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
buffer.append(fh.read())
MemoryError
Also, what is the possible solution to this?
Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?

Convert the PDF in blocks of 10 pages each time ( 1-10,11-20 and so on ... )
from pdf2image import pdfinfo_from_path,convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)
maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) :
convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))

I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.
Specify a format for the converted images.
By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.
convert_from_path('C:\path\to\your\pdf', fmt='jpeg')
That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.
Use an output directory
This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)
This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with context though!
Process the PDF file in chunks
pdf2image allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:
for i in range(0, 136 // 10 + 1):
convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)

The accepted answer has a small issue.
maxPages = pdf2image._page_count(pdf_file)
can no longer be used, as _page_count is deprecated. I found the working solution for the same.
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
thread_count=1, userpw=None,
use_cropbox=False, strict=False)
This way, however large the file, it will process 100 at once and the ram usage is always minimal.

A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
https://github.com/Belval/pdf2image i guess will help you to understand.
Solution: Break the pdf in small parts and convert it into image. The image could be merge...
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("document.pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
split a multi-page pdf file into multiple pdf files with python?
import numpy as np
import PIL
list_im = ['Test1.jpg', 'Test2.jpg', 'Test3.jpg']
imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'Trifecta.jpg' )
# for a vertical stacking it is simple: use vstack
imgs_comb = np.vstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'Trifecta_vertical.jpg' )
refer:Combine several images horizontally with Python

eventually, combining these techniques, I ended up coding like following, given the goal to convert a pdf into a pptx with avoiding memory overflow and good speed in mind:
import os, sys, tempfile, pprint
from PIL import Image
from pdf2image import pdfinfo_from_path,convert_from_path
from pptx import Presentation
from pptx.util import Inches
from io import BytesIO
pdf_file = sys.argv[1]
print("Converting file: " + pdf_file)
# Prep presentation
prs = Presentation()
blank_slide_layout = prs.slide_layouts[6]
# Create working folder
base_name = pdf_file.split(".pdf")[0]
# Convert PDF to list of images
print("Starting conversion...")
print()
path: str = "C:/ppttemp" #temp dir (use cron to delete files older than 1h hourly)
slideimgs = []
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path='C:/Program Files/poppler-0.90.1/bin/')
maxPages = info["Pages"]
for page in range(1, maxPages+1, 5) :
slideimgs.extend( convert_from_path(pdf_file, dpi=250, output_folder=path, first_page=page, last_page = min(page+5-1,maxPages), fmt='jpeg', thread_count=4, poppler_path='C:/Program Files/poppler-0.90.1/bin/', use_pdftocairo=True) )
print("...complete.")
print()
# Loop over slides
for i, slideimg in enumerate(slideimgs):
if i % 5 == 0:
print("Saving slide: " + str(i))
imagefile = BytesIO()
slideimg.save(imagefile, format='jpeg')
imagedata = imagefile.getvalue()
imagefile.seek(0)
width, height = slideimg.size
# Set slide dimensions
prs.slide_height = height * 9525
prs.slide_width = width * 9525
# Add slide
slide = prs.slides.add_slide(blank_slide_layout)
pic = slide.shapes.add_picture(imagefile, 0, 0, width=width * 9525, height=height * 9525)
# Save Powerpoint
print("Saving file: " + base_name + ".pptx")
prs.save(base_name + '.pptx')
print("Conversion complete. :)")
print()

Related

Remove background from a directory of JPEG images

I wrote a code to remove the background of 8000 images but that whole code is taking approximately 8 hours to give the result.
How to improve its time complexity as I have to work on a large dataset in future?
Or do I have to write a whole new code? If it is, please suggest some sample codes.
from rembg import remove
import cv2
import glob
for img in glob.glob('../images/*.jpg'):
a = img.split('../images/')
a1 = a[1].split('.jpg')
try:
cv_img = cv2.imread(img)
output = remove(cv_img)
except:
continue
cv2.imwrite('../output image/' + str(a1[0]) + '.png', output)

One simple approach would be to divide the work into multiple threads. See ThreadPoolExecutor for more.
You can play around with max_workers= to see what get's the best results. Note that max-workers can be any number between 1 and 32.
This sample code is ready to run. It assumes the image files are in the same directory as your main.py and the output_image directory exits.
import cv2
import rembg
import sys
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
out_dir = Path("output_image")
in_dir = Path(".")
def is_image(absolute_path: Path):
return absolute_path.is_file and str(absolute_path).endswith('.png')
input_filenames = [p for p in filter(is_image, Path(in_dir).iterdir())]
def process_image(in_dir):
try:
image = cv2.imread(str(in_dir))
if image is None or not image.data:
raise cv2.error("read failed")
output = rembg.remove(image)
in_dir = out_dir / in_dir.with_suffix(".png").name
cv2.imwrite(str(in_dir), output)
except Exception as e:
print(f"{in_dir}: {e}", file=sys.stderr)
executor = ThreadPoolExecutor(max_workers=4)
for result in executor.map(process_image, input_filenames):
print(f"Processing image: {result}")

Check out the U^2Net repository. Like u2net_test.py, Writing your own remove function and using dataloaders can speed up the process. if it is not necessary skip the alpha matting else you can add the alpha matting code from rembg.
def main():
# --------- 1. get image path and name ---------
model_name='u2net'#u2netp
image_dir = os.path.join(os.getcwd(), 'test_data', 'test_images')
prediction_dir = os.path.join(os.getcwd(), 'test_data', model_name + '_results' + os.sep)
model_dir = os.path.join(os.getcwd(), 'saved_models', model_name, model_name + '.pth')
img_name_list = glob.glob(image_dir + os.sep + '*')
print(img_name_list)
#1. dataloader
test_salobj_dataset = SalObjDataset(img_name_list = img_name_list,
lbl_name_list = [],
transform=transforms.Compose([RescaleT(320),
ToTensorLab(flag=0)])
)
test_salobj_dataloader = DataLoader(test_salobj_dataset,
batch_size=1,
shuffle=False,
num_workers=1)
for i_test, data_test in enumerate(test_salobj_dataloader):
print("inferencing:",img_name_list[i_test].split(os.sep)[-1])
inputs_test = data_test['image']
inputs_test = inputs_test.type(torch.FloatTensor)
if torch.cuda.is_available():
inputs_test = Variable(inputs_test.cuda())
else:
inputs_test = Variable(inputs_test)
d1,d2,d3,d4,d5,d6,d7= net(inputs_test)
# normalization
pred = d1[:,0,:,:]
pred = normPRED(pred)
# save results to test_results folder
if not os.path.exists(prediction_dir):
os.makedirs(prediction_dir, exist_ok=True)
save_output(img_name_list[i_test],pred,prediction_dir)
del d1,d2,d3,d4,d5,d6,d7

Try to use parallelization with multiprocessing like Mark Setchell mentioned in his comment. I rewrote your code according to Method 8 from here. Multiprocessing should speed up your execution time. I did not test the code, try if it works.
import glob
from multiprocessing import Pool
import cv2
from rembg import remove
def remove_background(filename):
a = filename.split("../images/")
a1 = a[1].split(".jpg")
try:
cv_img = cv2.imread(filename)
output = remove(cv_img)
except:
continue
cv2.imwrite("../output image/" + str(a1[0]) + ".png", output)
files = glob.glob("../images/*.jpg")
pool = Pool(8)
results = pool.map(remove_background, files)

Ah, you used the example from https://github.com/danielgatis/rembg#usage-as-a-library as template for your code. Maybe try the other example with PIL image instead of OpenCV. The latter is mostly less fast, but who knows. Try it with maybe 10 images and compare execution time.
Here is your code using PIL instead of OpenCV. Not tested.
import glob
from PIL import Image
from rembg import remove
for img in glob.glob("../images/*.jpg"):
a = img.split("../images/")
a1 = a[1].split(".jpg")
try:
cv_img = Image.open(img)
output = remove(cv_img)
except:
continue
output.save("../output image/" + str(a1[0]) + ".png")

How to set an ImageDocument to be not dirty in dm-script

How do I set an ImageDocument not to be dirty anymore in python dm-script without saving?
I have the python code posted below which can be represented by the following dm-script code.
String file_path = GetApplicationDirectory(0, 1).PathConcatenate("test-image.dm4");
Image img := realimage("test", 4, 64, 64);
ImageDocument doc = img.ImageGetOrCreateImageDocument();
doc.ImageDocumentSaveToFile("Gatan Format", file_path);
doc.ImageDocumentShowAtRect(100, 100, 164, 164);
The (python code below) creates and shows an ImageDocument. The image is saved already. But even saving it directly in DigitalMicrograph with its own module it does not recognize that it is saved. I can link the file manually (by executing dm-script code from python) but I cannot tell the program that the images are not modified.
There is a function ImageDocumentIsDirty(). But this function only returns whether the image is modified or not. I cannot set it.
My program creates a new workspace and loads more than 100 images. When closing DigitalMicrograph, it asks for every single of the 100 images if it should be saved. I cannot leave the user with 100 times clicking No. Especially because the files are saved.
So, how do I tell dm-script that the image is saved already?
try:
import DigitalMicrograph as DM
import numpy as np
import execdmscript
import os
name = "Test image"
file_path = os.path.join(os.getcwd(), "test-image.dm4")
# create image
image_data = np.random.random((64, 64))
image = DM.CreateImage(image_data)
image.SetName(name)
# create, save and show image document
image_doc = image.GetOrCreateImageDocument()
image_doc.SetName(name)
image_doc.SaveToFile("Gatan Format", file_path)
print("Saving image to", file_path)
image_doc.ShowAtRect(100, 100, 164, 164)
# link the image to the file
dmscript = "\n".join((
"for(number i = CountImageDocuments() - 1; i >= 0; i--){",
"ImageDocument img_doc = GetImageDocument(i);",
"if(img_doc.ImageDocumentGetName() == name){",
"img_doc.ImageDocumentSetCurrentFile(path);",
"break;",
"}",
"}"
))
svars = {
"name": image_doc.GetName(),
"path": file_path
}
with execdmscript.exec_dmscript(dmscript, setvars=svars):
pass
except Exception as e:
print("{}: ".format(e.__class__.__name__), e)
import traceback
traceback.print_exc()

the command you're looking for is
void ImageDocumentClean( ImageDocument imgDoc )
as in
image img := realimage("test",4,100,100)
img.ShowImage()
imageDocument doc = img.ImageGetOrCreateImageDocument()
Result("\n Dirty? " + doc.ImageDocumentIsDirty())
doc.ImageDocumentClean()
Result("\n Dirty? " + doc.ImageDocumentIsDirty())
Also: The reason it becomes dirty in a first place is, that window-positions are stored as part of the document. (Other things, like tags, could also apply.)

Cannot get tiff image resolution

I'm trying to read 16 bit .tif microscope images from
https://data.broadinstitute.org/bbbc/BBBC006/
and analyze them using
https://github.com/sakoho81/pyimagequalityranking/tree/master/pyimq
however I got an error in the part of the code that loads the tif image.
It uses the PIL tiffimageplugin:
https://pillow.readthedocs.io/en/3.0.0/_modules/PIL/TiffImagePlugin.html
and when it tries to get the resolution tag, it gives me a keyerror
Any ideas why? Advice? Fixes?
Thanks!
import os
import numpy
import scipy.ndimage.interpolation as itp
import argparse
from PIL import Image
from PIL.TiffImagePlugin import X_RESOLUTION, Y_RESOLUTION
from matplotlib import pyplot as plt
from math import log10, ceil, floor
def get_image_from_imagej_tiff(cls, path):
"""
A class method for opening a ImageJ tiff file. Using this method
will enable the use of correct pixel size during analysis.
:param path: Path to an image
:return: An object of the MyImage class
"""
assert os.path.isfile(path)
assert path.endswith(('.tif', '.tiff'))
print(path) #my own little debug thingamajig
image = Image.open(path)
xresolution = image.tag.tags[X_RESOLUTION][0][0] #line that errors out
yresolution = image.tag.tags[Y_RESOLUTION][0][0]
#data = utils.rescale_to_min_max(numpy.array(image), 0, 255)
if data.shape[0] == 1:
data = data[0]
return cls(images=data, spacing=[1.0/xresolution, 1.0/yresolution])
terminal input:
pyimq.main --mode=directory --mode=analyze --mode=plot --working-directory=/home/myufa/predxion/BBBC/a_1_s1 --normalize-power --result=fstd --imagej
output:
Mode option is ['directory', 'analyze', 'plot']
/home/myufa/predxion/BBBC/a_1_s1/z0_a_1_s1_w1.tif
Traceback (most recent call last):
File "/home/myufa/.local/bin/pyimq.main", line 11, in <module>
load_entry_point('PyImageQualityRanking==0.1', 'console_scripts', 'pyimq.main')()
File "/home/myufa/anaconda3/lib/python3.7/site-packages/PyImageQualityRanking-0.1-py3.7.egg/pyimq/bin/main.py", line 148, in main
File "/home/myufa/anaconda3/lib/python3.7/site-packages/PyImageQualityRanking-0.1-py3.7.egg/pyimq/myimage.py", line 81, in get_image_from_imagej_tiff
KeyError: 282
Edit: Here's what I got when I tried some suggestions/indexed the tag, which makes even less sense

I guess the tiff in question isn't following the normal image conventions. The [XY]Resolution tags, number 282 and 283, are mandatory or required in a whole bunch of specifications, but none the less may not be present in all applications. I have some TIFFs (DNG format) that wouldn't load with PIL (Pillow) at all; that prompted me to write a script to dump the primary tag structure:
# TIFF structure program
import struct
import PIL.TiffTags
class DE:
def __init__(self, tiff):
self.tiff = tiff
(self.tag, self.type, self.count, self.valueoroffset) = struct.unpack(
tiff.byteorder+b'HHI4s', self.tiff.file.read(12))
# TODO: support reading the value
def getstring(self):
offset = struct.unpack(self.tiff.byteorder+b'I', self.valueoroffset)[0]
self.tiff.file.seek(offset)
return self.tiff.file.read(self.count)
class IFD:
def __init__(self, tiff):
self.tiff = tiff
self.offset = tiff.file.tell()
(self.len,) = struct.unpack(self.tiff.byteorder+b'H', self.tiff.file.read(2))
def __len__(self):
return self.len
def __getitem__(self, index):
if index>=self.len or index<0:
raise IndexError()
self.tiff.file.seek(self.offset+2+12*index)
return DE(self.tiff)
def nextoffset(self):
self.tiff.file.seek(self.offset+2+12*self.len)
(offset,) = struct.unpack(self.tiff.byteorder+b'I', self.tiff.file.read(4))
return (offset if offset!=0 else None)
class TIFF:
def __init__(self, file):
self.file = file
header = self.file.read(8)
self.byteorder = {b'II': b'<', b'MM': b'>'}[header[:2]]
(magic, self.ifdoffset) = struct.unpack(self.byteorder+b'HI', header[2:])
assert magic == 42
def __iter__(self):
offset = self.ifdoffset
while offset:
self.file.seek(offset)
ifd = IFD(self)
yield ifd
offset = ifd.nextoffset()
def main():
tifffile = open('c:/users/yann/pictures/img.tiff', 'rb')
tiff = TIFF(tifffile)
for ifd in tiff:
print(f'IFD at {ifd.offset}, {ifd.len} entries')
for entry in ifd:
print(f' tag={entry.tag} {PIL.TiffTags.lookup(entry.tag).name}')
if __name__=='__main__':
main()
A quicker way, since you at least have the image object, might be:
import pprint, PIL.TiffTags
pprint.pprint(list(map(PIL.TiffTags.lookup, img.tag)))
One of these might give you a clue what the actual contents of the TIFF are. Since PIL could load it, it probably has pixel counts but not physical resolution.

Figured out a quick fix, writing
image.tag[X_RESOLUTION]
before
xresolution = image.tag.tags[X_RESOLUTION][0][0]
made the info available in the tag.tags dictionary for some reason. Can anyone chime in and explain why this might be? Would love to learn/make sure I didn't mess it up

Is there a way to speed up PDF page merging (basically watermarking one with the other), when the base page is used repeatedly?

Clarification: I don't want to add pages to a PDF file. I want to add content to a very big PDF page. The page changes sometimes and the content is different every time.
I'm using pypdf2 and reportlab to make small additions to big PDF pages (~10MB). This takes 30 seconds and more and the majority of that time is spend parsing the original.
Usually the page also needs to be turned using mergeRotatedTranslatedPage.
My idea was to generate the content array of the original once and then copy it every time I want to add something. So I modified PageObject._merge to do just that. It worked... kind of. I'm now down to 18 sec.
Is there a better way to speed up this process? 18 sec for one page is still pretty slow.

If you want to use 100% capacity of all the cores of your processor, you can do it with "multiprocessing", as follows:
We count the number of pages in the PDF and the number of nuclei that your processor has, in order to calculate how many pages have to work each nucleus has.
The pages that must work are sent to each nucleus and at the end all the PDF's are joined.
# -*- coding: utf-8 -*-
from io import BytesIO
from PyPDF2 import PdfFileWriter, PdfFileReader, PdfFileMerger
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.colors import Color
from webcolors import hex_to_rgb
import base64
import multiprocessing
import math
class SkyMark:
self.mainPdf=''
self.mask=''
#beginning, call from your api rest y pass request
def begin(self,request):
stringPdfBase64 = str(request.json['pdfBase4'])
blob = self.workMultiprocess(stringPdfBase64,'MyWaterMark text')
pdfbase64 = blob.decode('utf-8')
return pdfbase64
def workMultiprocess(self,stringPdfBase64,message='SkyMark'):
try:
#get pdf PdfFileReader objeto with your message
self.mask = self.getMaskWaterMark(message)
#Convert main pdfB64 to PdfFileReader object
sanitizedPdf = stringPdfBase64.rsplit(',', 1)[-1]
data = base64.b64decode(sanitizedPdf)
stream = BytesIO(data)
self.mainPdf = PdfFileReader(stream , strict=False)
numPaginas = self.mainPdf .getNumPages()
#count your cores of your processor
coresProcessor = int( multiprocessing.cpu_count() ) or 22
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
#calculate how many pages each processor has
byPage= int( math.ceil( numPaginas/coresProcessor ))
pagesFrom=0
pagesTo=0
#Send task for every core
for i in range(coresProcessor):
pagesFrom = pagesTo
pagesTo = pagesFrom + byPage
if pagesTo>numPaginas:
pagesTo=numPaginas
p = multiprocessing.Process(target=self.doByPage, args=(pagesFrom,pagesTo,i, return_dict))
jobs.append(p)
p.start()
if pagesTo>=numPaginas:
break
for proc in jobs:
proc.join()
#Order single PDF's for merge
randData = return_dict.values()
ascArray = sorted(randData, key=lambda k: k['procnum'])
singlePdfsArray = []
for pdfs in ascArray:
singlePdfsArray.append(pdfs['dataB64'])
#merge task
return self.mergePdfsArray(singlePdfsArray)
except Exception as e:
print(f'Error {e}')
#Explotamos los cores del procesador
def doByPage(self,fromPage,toPage,procnum,return_dict):
output = PdfFileWriter()
waterMark = self.mask.getPage(0)
for i in range(fromPage,toPage):
#print(f'WaterMark page: {i}, Core: {procnum}')
page = self.mainPdf.getPage(i)
page.mergePage(waterMark)
page.compressContentStreams()
output.addPage(page)
letter_data = BytesIO()
output.write(letter_data)
letter_data.seek(0)
dataB64 = base64.b64encode(letter_data.read())
return_dict[procnum] = {'procnum':procnum,'dataB64':dataB64}
#Single Pdf with your watermark
def getMaskWaterMark(self,texto):
font_name='Helvetica'
font_size=22
color='#000000'
opacity=0.08
x=1
y=840
filename=''
bgTexto='';
for i in range(1, 6):
bgTexto+= ' '+texto;
cantidadRenglones=100
mask_stream = BytesIO()
watermark_canvas = canvas.Canvas(mask_stream, pagesize=A4)
watermark_canvas.setFont(font_name, font_size)
r, g, b = hex_to_rgb(color)
c = Color(r, g, b, alpha=opacity)
watermark_canvas.setFillColor(c)
print(watermark_canvas)
for i in range(1, cantidadRenglones):
watermark_canvas.drawString(x, y-(i * 25), bgTexto)
watermark_canvas.save()
mask_stream.seek(0)
mask = PdfFileReader(mask_stream , strict=False)
return mask
#Merge all pdf in only one pdf
def mergePdfsArray(self,arrayPdfsBase64):
merge = PdfFileMerger()
for f in arrayPdfsBase64:
nada = base64.b64decode(f)
stre = BytesIO(nada)
src = PdfFileReader(stre , strict=False)
merge.append(src)
letter_data = BytesIO()
merge.write(letter_data)
letter_data.seek(0)
data = base64.b64encode(letter_data.read())
return data

Django PIL : IOError Cannot identify image file

I'm learning Python and Django.
An image is provided by the user using forms.ImageField(). Then I have to process it in order to create two different sized images.
When I submit the form, Django returns the following error:
IOError at /add_event/
cannot identify image file
I call the resize function:
def create_event(owner_id, name, image):
image_thumb = image_resizer(image, name, '_t', 'events', 180, 120)
image_medium = image_resizer(image, name, '_m', 'events', 300, 200)
I get en error when image_resizer is called for the second time:
def image_resizer(image, name, size, app_name, length, height):
im = Image.open(image)
if im.mode != "RGB":
im = im.convert("RGB")
im = create_thumb(im, length, height)
posit = str(MEDIA_ROOT)+'/'+app_name+'/'
image_2 = im
image_name = name + size +'.jpg'
imageurl = posit + image_name
image_2.save(imageurl,'JPEG',quality=80)
url_image='/'+app_name+'/'+image_name
return url_image
Versions:
Django 1.3.1
Python 2.7.1
PIL 1.1.7
I'm trying to find the problem, but i don't know what to do. Thank you in advanced!
EDIT
I solved rewriting the function; now it creates the different images in batch:
I call the resize function:
url_array = image_resizer.resize_batch(image, image_name, [[180,120,'_t'], [300,200,'_m']], '/events/')
so:
image_thumb = url_array[0]
image_medium = url_array[1]
and the resize function:
def resize_batch(image, name, size_array, position):
im = Image.open(image)
if im.mode != "RGB":
im = im.convert("RGB")
url_array = []
for size in size_array:
new_im = create_thumb(im, size[0], size[1])
posit = str(MEDIA_ROOT) + position
image_name = name + size[2] +'.jpg'
imageurl = posit + image_name
new_im.save(imageurl,'JPEG',quality=90)
new_url_array = position + image_name
url_array.append(new_url_array)
return url_array
Thanks to all!

As ilvar asks in the comments, what kind of object is image? I'm going to assume for the purposes of this answer that it's the file property of a Django ImageField that comes from a file uploaded by a remote user.
After a file upload, the object you get in the ImageField.file property is a TemporaryUploadedFile object that might represent a file on disk or in memory, depending on how large the upload was. This object behaves much like a normal Python file object, so after you have read it once (to make the first thumbnail), you have reached the end of the file, so that when you try to read it again (to make the second thumbnail), there's nothing there, hence the IOError. To make a second thumbnail, you need to seek back to the beginning of the file. So you could add the line
image.seek(0)
to the start of your image_resizer function.
But this is unnecessary! You have this problem because you are asking the Python Imaging Library to re-read the image for each new thumbnail you want to create. This is a waste of time: better to read the image just once and then create all the thumbnails you want.

I'm guessing that is a TemporaryUploadedFile ... find this with type(image).
import cStringIO
if isinstance(image, TemporaryUploadedFile):
temp_file = open(image.temporary_file_path(), 'rb+')
content = cStringIO.StringIO(temp_file.read())
image = Image.open(content)
temp_file.close()
I'm not 100% sure of the code above ... comes from 2 classes I've got for image manipulation ... but give it a try.
If is a InMemoryUploadedFile your code should work!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to solve MemoryError using Python 3.7 pdf2image library? - python

Related

Remove background from a directory of JPEG images

How to set an ImageDocument to be not dirty in dm-script

Cannot get tiff image resolution

Is there a way to speed up PDF page merging (basically watermarking one with the other), when the base page is used repeatedly?

Django PIL : IOError Cannot identify image file

Categories

Resources