Crop a pdf page to content - python

Using Python, is it possible to crop a pdf page to the content as shown in the image below where the task is achieved in Inkscape? The bounding area for the content should be found automatically.
Using PyPDF2 I can crop the page, but it requires the coordinates to be manually found, which is tedious for a large number of files. In Inkscape, the coordinates are automatically found.
The code I'm using is shown below and an example input file is available here.
# Python 3.7.0
import PyPDF2 # version 1.26.0
with open('document-1.pdf','rb') as fin:
pdf = PyPDF2.PdfFileReader(fin)
page = pdf.getPage(0)
# Coordinates found by inspection.
# Can these coordinates be found automatically?
page.cropBox.lowerLeft=(88,322)
page.cropBox.upperRight = (508,602)
output = PyPDF2.PdfFileWriter()
output.addPage(page)
with open('cropped-1.pdf','wb') as fo:
output.write(fo)

I was able to do this with the pip-installable CLI https://pypi.org/project/pdfCropMargins/
Since I originally answered, a Python interface has been added: https://github.com/abarker/pdfCropMargins#python-interface (h/t #Paul)
My original answer calling it from the commandline is below.
Unfortunately, I don't believe there's a great way to call it directly from a script, so for now I'm using os.system.
$ python -m pip install pdfCropMargins --user
$ pdf-crop-margins document.pdf -o output.pdf -p 0
import os
os.system('pdf-crop-margins document.pdf -o output.pdf -p 0')

Related

Python exifread - Parsing difficulties

I'm trying to get the coordinates of a number of photos, i.e. I'm trying to get the exif data using a python script. The goal is to georeference all the photos and display their locations on a map. I am encountering problems with exif, however. I'm on Windows (64bit) and installed the corresponding (Strawberry) Perl software and then the Exiftool module (version 12.30) using Anaconda (Navigator), but to no avail. It gives me the following error: ModuleNotFoundError: No module named 'exif'. If I use the command pip install exif it tells me that the requirements are already met. What am I missing here? I'll gladly provide more information if required.
... I also tried an alternative: the module exifread works without import problems but does not seem to have all the necessary functionality (I can read the coordinates, but can't handle the extraction of the coordinates, it gives me a IfdTag-object when I would like an array of the degrees, minutes and seconds that I can then further process.)
There is a utility function exifread.utils.get_gps_coord() that provides convenient method to access coordinates as tuple in the format (latitude, longitude). Note negative value for latitude means South, negative value for longitude - West
example
import exifread
path = 'image.jpg'
with open(path, 'rb') as f:
tags = exifread.process_file(f, details=False)
coord = exifread.utils.get_gps_coords(tags)
print(coord)
For sake of completeness, there are also other modules to work with exif:
Pillow - there is functionality to work with exif
piexif
Also, as mentioned in the comments - you can use ExifTool (Perl software), via subprocess

Time efficient way to convert PDF to image

Context:
I have PDF files I'm working with.
I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called fitz which is a python binding to MuPDF.
First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
pip install PyMuPDF
Then import the fitz module:
import fitz
print(fitz.__doc__)
>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)
Hope this helps anyone else.

Extract text from multiple images to CSV using OCR

I want to extract texts from thousand of images and put it into a CSV file. Can anyone tell me how to do that? I have images saved on my desktop.
sure.
install pytesseract module using this command:
pip install pytesseract
install tesseract engine executable from this urls:
tesseract cmd 32 bit
or
tesseract cmd 64 bit
create a python script called images_to_csv.py and paste this code:
import pytesseract
from PIL import Image # pip install Pillow
# set tesseract cmd to the be the path to your tesseract engine executable
# (where you installed tesseract from above urls)
# IMPORTANT: this should end with '...\tesseract.exe'
pytesseract.pytesseract.tesseract_cmd = <path_to_your_tesseract_cmd>
# and start doing it
# your saved images on desktop
list_with_many_images = [
"path1",
"path2"
# ...
"pathN"
]
# create a function that returns the text
def image_to_str(path):
""" return a string from image """
return pytesseract.image_to_string(Image.open(path))
# now pure action + csv part
with open("images_content.csv", "w+", encoding="utf-8") as file:
file.write("ImagePath, ImageText")
for image_path in list_with_many_images:
text = image_to_str(image_path)
line = f"{image_path}, {text}\n"
file.write(line)
this is all for beginning.
if you want to use module csv go ahead.
enjoy.

ImageMagick & PyPDF2 Crashing Python When used Together

I have a PDF file consisting of around 20-25 pages. The aim of this tool is to split the PDF file into pages (using PyPdf2), save every PDF page in a directory (using PyPdf2), convert the PDF pages into images (using ImageMagick) and then perform some OCR on them using tesseract (using PIL and PyOCR) to extract data. The tool will eventually be a GUI through tkinter so the users can perform the same operation many times by clicking on a button. Throughout my heavy testing, I have noticed that if the whole process is repeated around 6-7 times, the tool/python script crashes by showing not responding on Windows. I have performed some debugging, but unfortunately there is no error thrown. The memory and CPU are good so no issues there as well. I was able to narrow down the problem by observing that, before reaching to the tesseract part, PyPDF2 and ImageMagick are failing when they are run together. I was able to replicate the problem by simplifying it to the following Python code:
from wand.image import Image as Img
from PIL import Image as PIL
import pyocr
import pyocr.builders
import io, sys, os
from PyPDF2 import PdfFileWriter, PdfFileReader
def splitPDF (pdfPath):
#Read the PDF file that needs to be parsed.
pdfNumPages =0
with open(pdfPath, "rb") as pdfFile:
inputpdf = PdfFileReader(pdfFile)
#Iterate on every page of the PDF.
for i in range(inputpdf.numPages):
#Create the PDF Writer Object
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("tempPdf%s.pdf" %i, "wb") as outputStream:
output.write(outputStream)
#Get the number of pages that have been split.
pdfNumPages = inputpdf.numPages
return pdfNumPages
pdfPath = "Test.pdf"
for i in range(1,20):
print ("Run %s\n--------" %i)
#Split the PDF into Pages & Get PDF number of pages.
pdfNumPages = splitPDF (pdfPath)
print(pdfNumPages)
for i in range(pdfNumPages):
#Convert the split pdf page to image to run tesseract on it.
with Img(filename="tempPdf%s.pdf" %i, resolution=300) as pdfImg:
print("Processing Page %s" %i)
I have used the with statement to handle the opening and closing of files correctly, so there should be no memory leaks there. I have tried running the splitting part separately and the image conversion part separately, and they work fine when ran alone. However when the codes are combined, it will fail after iterating for around 5-6 times. I have used try and exception blocks but no error is captured. Also I am using the latest version of all the libraries. Any help or guidance is appreciated.
Thank you.
For future reference, the problem was due to the 32-bit version of ImageMagick as mentioned in one of the comments (thanks to emcconville). Uninstalling Python and ImageMagick 32-bit versions and installing both 64-bit versions fixed the problem. Hope this helps.

Image format on imgur

I'm playing around in python trying to download some images from imgur. I've been using the urrlib and urllib.retrieve but you need to specify the extension when saving the file. This isn't a problem for most posts since the link has for example .jpg in it, but I'm not sure what to do when the extension isn't there. My question is if there is any way to determine the image format of the file before downloading it. The question is mostly imgur specific, but I wouldn't mind a solution for most image-hosting sites.
Thanks in advance
You can use imghdr.what(filename[, h]) in Python 2.7 and Python 3 to determine the image type.
Read here for more info, if you're using Python 2.7.
Read here for more info, if you're using Python 3.
Assuming the picture has no file extension, there's no way to determine which type it is before you download it. All image formats sets their initial bytes to a particular value. To inspect these 'magic' initial bytes check out https://github.com/ahupp/python-magic - it matches the initial bytes against known image formats.
The code below downloads a picture from imgur and determines which file type it is.
import magic
import requests
import shutil
r = requests.get('http://i.imgur.com/yed5Zfk.gif', stream=True) ##Download picture
if r.status_code == 200:
with open('~/Desktop/picture', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
print magic.from_file('~/Desktop/picture') ##Determine type
## Prints: 'GIF image data, version 89a, 360 x 270'

Categories

Resources