Split large PDF file into single PDF's with python - python

I'm trying to split a large PDF file per page from page 5000 to 6000. The PDF files has 7000 pages with text and images and is 250MB big. The python code I have written is working for smaller PDF files.
I'm receiving the following errors:
First error is RecursionError: maximum recursion depth exceeded.
After setting sys.setrecursionlimit(9999) I'm getting the following error Process finished with exit code -1073741571 (0xC00000FD). The PDF file has been written to my output folder but is corrupt and 0kb big. Increasing the recursion limit doesn't help either.
What could I do? Compress the PDF file and then split?
This is my code:
pdf_file = open(path,'rb')
pdf_reader = PdfFileReader(pdf_file)
pageNumbers = pdf_reader.getNumPages()
output = PdfFileWriter()
#this is just to test if it works for 1 page
output.addPage(pdf_reader.getPage(5854))
with open("output_path" + "document-output.pdf", "wb") as f:
output.write(f)

Sharing what worked for me. I have used the package wand in order to split this PDF file of 7000 pages. wand package
from wand.image import Image
# Converting #page into JPG
with Image(filename="C:/Users/Name/Documents/PDFfile.pdf[5950]", resolution= 300) as img:
img.save(filename="C:/Users/Name/Documents/temp1.jpg")

Related

Python with pytesseract - How to get the same output for pytesseract.image_to_data in a searchable PDF?

I have this piece of code in Python that makes use of pytesseract (method pytesseract.image_to_data).
This gives me great text information and coordinates that are saved in a text file that is fed to a third party software. It works perfectly for PDF files that have been scanned
data = pytesseract.image_to_data(Image.open('file-001-page-001.png')))
The issue now is that I have a demand for output in the exact same structure for PDFs that already contain text. It's possible to keep the same code and continue as if the PDF had no text, extracting images and doing OCR, but it doesn't seem like the right solution...
Is it possible to achieve this with pytesseract?
Suggestions are welcome
You can use this:
import pytesseract
from PIL import Image
# Open the PDF file
with open('file.pdf', 'rb') as f:
# Extract text from the PDF file and save it to a variable
text = pytesseract.image_to_pdf_or_hocr(f, extension='hocr', lang='eng', config='--oem 3 --psm 6')
# Save the extracted text to a file in the desired format
with open('output.hocr', 'w')as f:
f.write(text)

Using Python to scrape text from PDFs that encode their text to images

My code is below. I've tried it on other PDFs and it was able to extract the text accurately.
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
Specifically when I run the above code there is no output. The provider of the PDF tries to sell the data in the PDF, so it makes sense why they don't want it to be easily scraped. Just wondering what the best workaround is because I don't have 100k lying around.
If it helps it looks like the PDF was produced with pdfsharp.net. When I upload my PDF in Google Colab and assign it to a variable, a portion of the result of printing that variable is below.
{'test.pdf': b'%PDF-1.4\n%\xd3\xf4\xcc\xe1\n1 0
obj\n<<\n/CreationDate(D:20190310110705-04\'00\')\n/Title(Efficiency Summary
Player Name)\n/Creator(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n/Producer(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n>>\nendobj\n2 0 obj\n<<\n/Type/Catalog\n/Pages 3 0 R\n>>\nendobj\n3 0 obj\n<<\n/Type/Pages\n/Count 1\n/Kids[4 0 R]\n>>\nendobj\n4 0 obj\n<<\n/Type/Page\n/MediaBox[0 0 612 792]\n/Parent 3 0 R\n/Contents 5 0 R\n/Resources\n<<\n/ProcSet [/PDF/Text/ImageB/ImageC/ImageI]\n/XObject\n<<\n/I0 8 0 R\n>>\n>>\n/Group\n<<\n/CS/DeviceRGB\n/S/Transparency\n/I false\n/K false\n>>\n>>\nendobj\n5 0 obj\n<<\n/Length 62\n/Filter/FlateDecode\n>>\nstream\nx\x9c+\xe42T0\x00B]\x10eni\xa4\x90\x9c\x0bd\x1b\x18(\x84Tq\x15r\x15*\x98\x9a\x1aA\xe4\xcd\xcd\xcc\x14\x8c\x8d\x14\xcc\xcd\xcd#J\xf4=\r\x14\\\xf2\x15\x02\xb9#\x10\x00\xd8\xf3\r\xe0\nendstream\nendobj\n6 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 159\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 1\n/ImageMask true\n>>\nstream\nx\x9c\xed\xc11\x01\x00\x00\x00\xc2 \xfb\xa76\xc6\x1e`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\xe8\'\xe0\x00\x01\nendstream\nendobj\n7 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 6413\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 8\n/ColorSpace/DeviceGray\n>>\nstream\nx\x9c\xed\xdd\x81z\xa2\xbc\x16\x05\xd0\xf7\x7f\xe9\xe4\xde\xbf\x85\xe4\x9c$X\xdb\xb1\x15t\xado\xa6U\x0c!\x02\xdb#\xb4R\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
This code might be useful to you, I used it for a previous project where I scraped data from a pdf. I'm not sure if you've tried using pytesseract. You can modify the for page in pages loop to extract specific pages. This code will turn the PDF into images, then use OCR processing and return a text file with the text found.
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import os
def OCR(pdf):
pdfName = pdf.split('.pdf')[0]
pages = convert_from_path(pdf, 500)
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(pdfName+filename, 'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
f= open(pdfName+".txt","wb")
text = ''
for i in range(1, filelimit + 1):
filename = pdfName+"page_"+str(i)+".jpg"
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
text = text.replace('\n',' \n')
os.remove(pdfName+"page_"+str(i)+".jpg")
f.write(text.encode('utf-8','replace'))
f.close()
return text
you're just seeing the raw bytes of a PDF file there, the fact they've put the "Info dict" at the top of the file, and hence seeing strings like \Creator, isn't guaranteed and just because it's a "linearised" file
doing something like Daniel suggested is the way to go, but his implementation might introduce additional artifacts. tesseract is OCR software and attempts to turn rasterized text back into characters. it might be better working directly with images in the PDF file, rather than rasterizing the whole page to an image. also encoding to a JPEG seems awkward, using a lossless format like PNG is probably going to do slightly better
generally I'd recommend using something like pytesseract, but something else, e.g. see here for getting at the images directly

ImageMagick & PyPDF2 Crashing Python When used Together

I have a PDF file consisting of around 20-25 pages. The aim of this tool is to split the PDF file into pages (using PyPdf2), save every PDF page in a directory (using PyPdf2), convert the PDF pages into images (using ImageMagick) and then perform some OCR on them using tesseract (using PIL and PyOCR) to extract data. The tool will eventually be a GUI through tkinter so the users can perform the same operation many times by clicking on a button. Throughout my heavy testing, I have noticed that if the whole process is repeated around 6-7 times, the tool/python script crashes by showing not responding on Windows. I have performed some debugging, but unfortunately there is no error thrown. The memory and CPU are good so no issues there as well. I was able to narrow down the problem by observing that, before reaching to the tesseract part, PyPDF2 and ImageMagick are failing when they are run together. I was able to replicate the problem by simplifying it to the following Python code:
from wand.image import Image as Img
from PIL import Image as PIL
import pyocr
import pyocr.builders
import io, sys, os
from PyPDF2 import PdfFileWriter, PdfFileReader
def splitPDF (pdfPath):
#Read the PDF file that needs to be parsed.
pdfNumPages =0
with open(pdfPath, "rb") as pdfFile:
inputpdf = PdfFileReader(pdfFile)
#Iterate on every page of the PDF.
for i in range(inputpdf.numPages):
#Create the PDF Writer Object
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("tempPdf%s.pdf" %i, "wb") as outputStream:
output.write(outputStream)
#Get the number of pages that have been split.
pdfNumPages = inputpdf.numPages
return pdfNumPages
pdfPath = "Test.pdf"
for i in range(1,20):
print ("Run %s\n--------" %i)
#Split the PDF into Pages & Get PDF number of pages.
pdfNumPages = splitPDF (pdfPath)
print(pdfNumPages)
for i in range(pdfNumPages):
#Convert the split pdf page to image to run tesseract on it.
with Img(filename="tempPdf%s.pdf" %i, resolution=300) as pdfImg:
print("Processing Page %s" %i)
I have used the with statement to handle the opening and closing of files correctly, so there should be no memory leaks there. I have tried running the splitting part separately and the image conversion part separately, and they work fine when ran alone. However when the codes are combined, it will fail after iterating for around 5-6 times. I have used try and exception blocks but no error is captured. Also I am using the latest version of all the libraries. Any help or guidance is appreciated.
Thank you.
For future reference, the problem was due to the 32-bit version of ImageMagick as mentioned in one of the comments (thanks to emcconville). Uninstalling Python and ImageMagick 32-bit versions and installing both 64-bit versions fixed the problem. Hope this helps.

using pytesseract to generate a PDF from image

I am using the following code to generate a PDF from image.
PDF=pytesseract.image_to_pdf_or_hocr(test_image,lang='dan',config='',nice=0,extension='pdf')
and the type of PDF variable is being shown as BYTES.
HOw Do i publish or get the PDF generated?
I have found the answer. Just to close the thread, posting the same.
f = open("demofile.pdf", "w+b")
f.write(bytearray(pdf))
f.close()
demofile.pdf happens to be resultant pdf which gets published in the workspace.
From Pytesseract-PYPI:
Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default

Problems with Python Wand and paths to JXR imagery when attempting to convert JXR imagery to JPG format?

I need to be able to convert JPEG-XR images to JPG format, and have gotten this working through ImageMagick itself. However, I need to be able to do this from a python application, and have been looking at using Wand.
Wand does not seem to properly use paths to JXR imagery.
with open(os.path.join(args.save_location, img_name[0], result[0]+".jxr"), "wb") as output_file:
output_file.write(result[1])
with Image(filename=os.path.join(args.save_location, img_name[0], result[0]+".jxr")) as original:
with original.convert('jpeg') as converted:
print(converted.format)
pass
The first part of this - creating output_file and writing result[1] (blob of JXR imagery from a SQLite database) - works fine. However, when I attempt to then open that newly-saved file as an image using Python and Wand, I get an error that ultimately suggests Wand is not looking in the correct location for the image:
Extracting panorama 00000
FAILED: -102=pWS->Read(pWS, szSig, sizeof(szSig))
JXRGlueJxr.c:1806
FAILED: -102=ReadContainer(pID)
JXRGlueJxr.c:1846
FAILED: -102=pDecoder->Initialize(pDecoder, pStream)
JXRGlue.c:426
FAILED: -102=pCodecFactory->CreateDecoderFromFile(args.szInputFile, &pDecoder)
e:\coding\python\sqlite panoramic image extraction tool\jxrlib\jxrencoderdecoder\jxrdecapp.c:477
JPEG XR Decoder Utility
Copyright 2013 Microsoft Corporation - All Rights Reserved
... [it outputs its help page in case of errors; snipped]
The system cannot find the file specified.
Traceback (most recent call last):
File "E:\Coding\Python\SQLite Panoramic Image Extraction Tool\SQLitePanoramicImageExtractor\trunk\PanoramicImageExtractor.py", line 88, in <module>
with Image(filename=os.path.join(args.save_location, img_name[0], result[0]+".jxr")) as original:
File "C:\Python34\lib\site-packages\wand\image.py", line 1991, in __init__
self.read(filename=filename, resolution=resolution)
File "C:\Python34\lib\site-packages\wand\image.py", line 2048, in read
self.raise_exception()
File "C:\Python34\lib\site-packages\wand\resource.py", line 222, in raise_exception
raise e
wand.exceptions.BlobError: unable to open image `C:/Users/RPALIW~1/AppData/Local/Temp/magick-14988CnJoJDwMRL4t': No such file or directory # error/blob.c/OpenBlob/2674
As you can see at the very end, it seems to have attempted to run off to open a temporary file 'C:/Users/RPALIW~1/AppData/Local/Temp/magick-14988CnJoJDwMRL4'. The filename used at this point should be exactly the same as the filename used to save the imagery as a file just a few lines above, but Wand has substituted something else? This looks similar to the last issue I had with this in ImageMagick, which was fixed over the weekend (detailed here: http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=27027&p=119702#p119702).
Has anyone successfully gotten Wand to open JXR imagery as an Image in Python, and convert to another format? Am I doing something wrong here, or is the fault with ImageMagick or Wand?
Something very similar is happening to me. I'm getting an error:
wand.exceptions.BlobError: unable to open image `/var/tmp/magick-454874W--g1RQEK3H.ppm': No such file or directory # error/blob.c/OpenBlob/2701
The path given is not the file path I of the image I am trying to open.
From the docs:
A binary large object could not be allocated, read, or written.
And I am trying to open a large file. (18mb .cr). Could the file size be the problem?
For me:
from wand.image import Image as WImage
with open(file_name, 'r+') as f:
with WImage(file = f) as img:
print 'Opened large image'
Or:
with open(file_name, 'r+') as f:
image_binary = f.read()
with WImage(blob = image_binary) as img:
print 'Opened Large Image'
Did the trick
~Victor

Categories

Resources