I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.
Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers
Related
I am Python beginner, and I want to read an .E01 image file with pytsk3.
pytsk3.FS_Info(image) creates this error:
OSError: FS_Info_Con: (tsk3.cpp:214) Unable to open the image as a filesystem at offset: 0x00000000 with error: Possible encryption detected (High entropy (8.00))
I used FTK imager to create the image, therefore it cannot be encrypted. I was able to open it with autopsy, so the image itself is fine (not corrupt).
import pytsk3
image_path = "C:/Users/Karin/Desktop/images/USB2.E01"
image = pytsk3.Img_Info(image_path) --> works correctly
fs = pytsk3.FS_Info(image) --> **does not work**
root = fs.open_dir(path="/")
for entry in root:
print(entry.info.name.name)
USB2.E01
E01 file is "forensically" created file. It's not a bit-to-bit copy of the source drive. It has additional information such as metadata and hashes of data chunks inside of it. Also this data could be compressed or not.
So, usually the basic programming method of reading simple files would not work.
I would recommend you to read more about the .E01 structure and check this library https://github.com/libyal/libewf
I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.
After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.
Context:
I have PDF files I'm working with.
I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called fitz which is a python binding to MuPDF.
First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
pip install PyMuPDF
Then import the fitz module:
import fitz
print(fitz.__doc__)
>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)
Hope this helps anyone else.
I'm playing around in python trying to download some images from imgur. I've been using the urrlib and urllib.retrieve but you need to specify the extension when saving the file. This isn't a problem for most posts since the link has for example .jpg in it, but I'm not sure what to do when the extension isn't there. My question is if there is any way to determine the image format of the file before downloading it. The question is mostly imgur specific, but I wouldn't mind a solution for most image-hosting sites.
Thanks in advance
You can use imghdr.what(filename[, h]) in Python 2.7 and Python 3 to determine the image type.
Read here for more info, if you're using Python 2.7.
Read here for more info, if you're using Python 3.
Assuming the picture has no file extension, there's no way to determine which type it is before you download it. All image formats sets their initial bytes to a particular value. To inspect these 'magic' initial bytes check out https://github.com/ahupp/python-magic - it matches the initial bytes against known image formats.
The code below downloads a picture from imgur and determines which file type it is.
import magic
import requests
import shutil
r = requests.get('http://i.imgur.com/yed5Zfk.gif', stream=True) ##Download picture
if r.status_code == 200:
with open('~/Desktop/picture', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
print magic.from_file('~/Desktop/picture') ##Determine type
## Prints: 'GIF image data, version 89a, 360 x 270'
So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.
Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.
If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.
You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.
You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).
Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:
Ghostscript Main Website
Ghostscript docs on Command line usage
Another stackoverflow thread that provides some examples of invoking Ghostscript's command line interface from Python
Ghostscript API Documentation
Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.
import win32api
import os
def pdf_to_jpg(pdfPath, pages):
# print pdf using jpg printer
# 'pages' is the number of pages in the pdf
filepath = pdfPath.rsplit('/', 1)[0]
filename = pdfPath.rsplit('/', 1)[1]
#print pdf to jpg using jpg printer
tempprinter = "ImagePrinter Pro"
printer = '"%s"' % tempprinter
win32api.ShellExecute(0, "printto", filename, printer, ".", 0)
# Add time delay to ensure pdf finishes printing to file first
fileFound = False
if pages > 1:
jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
else:
jpgName = filename.split('.')[0] + '.jpg'
jpgPath = filepath + '/' + jpgName
waitTime = 30
for i in range(waitTime):
if os.path.isfile(jpgPath):
fileFound = True
break
else:
time.sleep(1)
# print Error if the file was never found
if not fileFound:
print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
+ " seconds"
return jpgPath
The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages
in pdf_to_jpg(pdfPath)
6 # 'pages' is the number of pages in the pdf
7 filepath = pdfPath.rsplit('/', 1)[0]
----> 8 filename = pdfPath.rsplit('/', 1)[1]
9
10 #print pdf to jpg using jpg printer
IndexError: list index out of range
With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.
Here is the code necessary for converting a single PDF file into a sequence of PNG images:
from wand.image import Image
input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
Image(images[0]).save(filename=output_name.format(i))