I am using the following code to generate a PDF from image.
PDF=pytesseract.image_to_pdf_or_hocr(test_image,lang='dan',config='',nice=0,extension='pdf')
and the type of PDF variable is being shown as BYTES.
HOw Do i publish or get the PDF generated?
I have found the answer. Just to close the thread, posting the same.
f = open("demofile.pdf", "w+b")
f.write(bytearray(pdf))
f.close()
demofile.pdf happens to be resultant pdf which gets published in the workspace.
From Pytesseract-PYPI:
Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default
Related
I have this piece of code in Python that makes use of pytesseract (method pytesseract.image_to_data).
This gives me great text information and coordinates that are saved in a text file that is fed to a third party software. It works perfectly for PDF files that have been scanned
data = pytesseract.image_to_data(Image.open('file-001-page-001.png')))
The issue now is that I have a demand for output in the exact same structure for PDFs that already contain text. It's possible to keep the same code and continue as if the PDF had no text, extracting images and doing OCR, but it doesn't seem like the right solution...
Is it possible to achieve this with pytesseract?
Suggestions are welcome
You can use this:
import pytesseract
from PIL import Image
# Open the PDF file
with open('file.pdf', 'rb') as f:
# Extract text from the PDF file and save it to a variable
text = pytesseract.image_to_pdf_or_hocr(f, extension='hocr', lang='eng', config='--oem 3 --psm 6')
# Save the extracted text to a file in the desired format
with open('output.hocr', 'w')as f:
f.write(text)
I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper].
I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.
from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)
You can preserve layout/indentation using PDFtotext package.
import pdftotext
with open("target_file.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# All pages
for text in pdf:
print(text)
I'm trying to split a large PDF file per page from page 5000 to 6000. The PDF files has 7000 pages with text and images and is 250MB big. The python code I have written is working for smaller PDF files.
I'm receiving the following errors:
First error is RecursionError: maximum recursion depth exceeded.
After setting sys.setrecursionlimit(9999) I'm getting the following error Process finished with exit code -1073741571 (0xC00000FD). The PDF file has been written to my output folder but is corrupt and 0kb big. Increasing the recursion limit doesn't help either.
What could I do? Compress the PDF file and then split?
This is my code:
pdf_file = open(path,'rb')
pdf_reader = PdfFileReader(pdf_file)
pageNumbers = pdf_reader.getNumPages()
output = PdfFileWriter()
#this is just to test if it works for 1 page
output.addPage(pdf_reader.getPage(5854))
with open("output_path" + "document-output.pdf", "wb") as f:
output.write(f)
Sharing what worked for me. I have used the package wand in order to split this PDF file of 7000 pages. wand package
from wand.image import Image
# Converting #page into JPG
with Image(filename="C:/Users/Name/Documents/PDFfile.pdf[5950]", resolution= 300) as img:
img.save(filename="C:/Users/Name/Documents/temp1.jpg")
I'm trying to convert a PDF's first page to an image. However, the PDF is coming straight from the database in a base64 format. I then convert it to a blob. I want to know if it's possible to convert the first page of the PDF to an image within my Python code.
I'm familiar with being able to use filename in the Image object:
Image(filename="test.pdf[0]") as img:
The issue I'm facing is there is not an actual filename, just a blob. This is what I have so far, any suggestions would be appreciated.
x = object['file']
fileBlob = base64.b64decode('x')
with Image(**what do I put here for pdf blob?**) as img:
more code
It works for me
all_pages = Image(blob=blob_pdf) # PDF will have several pages.
single_image = all_pages.sequence[0] # Just work on first page
with Image(single_image) as i:
...
Documentation says something about blobs.
So it should be:
with Image(blob=fileBlob):
#etc etc
I didn't test that but I think this is what you are after.
I'm trying to convert a pdf to the same size as my pdf which is an A4 page.
convert my_pdf.pdf -density 300x300 -page A4 my_png.png
The resulting png file, however, is 595px × 842px which should be the resolution at 72 dpi.
I was thinking of using PIL to write some text on some of the pdf fields and convert it back to PDF. But currently the image is coming out wrong.
Edit: I was approaching the problem from the wrong angle. The correct approach didn't include imagemagick at all.
After searching around some I finally found the solution:
It turns out that this was the correct approach after all.
Yet, i feel that it wasn't verbose enough.
It appears that the poster probably took it from here (same variable names etc).
The idea: create new blank PDF with Reportlab which only contains a text string.
Then merge/add it as a watermark using pyPdf.
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100,100, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write "output" to a real file
outputStream = file("/home/joe/newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Hope this helps somebody else.
I just tried the solution above, but I had quite some troubles to get it running in Python3. So, I would like to share my modifications. The adapted code looks as follows:
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100, 100, "Hello world")
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Now the page.mergePage throws an error. Turns out to be a porting error in pypdf2. Please refer to this question for the solution: Porting to Python3: PyPDF2 mergePage() gives TypeError
You should look at Add text to Existing PDF using Python and also Python as PDF Editing and Processing Framework. These will point you in the right direction.
If you do what you've proposed in the question, when you export back to .pdf, it will really just be an image file embedded in a .pdf, it won't be text.
pdfrw will let you take existing PDFs and place them as form XObjects (similar to images) on a reportlab canvas. There are some examples for this in the pdfrw examples/rl1 subdirectory on github. Disclaimer -- I am the pdfrw author.