I need to transcribe an image.tif with several pages to text using pytesseract.
I have the next code:
> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))
The problem is that only extract the firs page. How can i extract all of them?
I was able to fix the same problem by calling the method convert() as below
image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)
I guess you have mentioned only one image "camara.tif" , First you have to convert all the pdf pages into images you can see this link for doing so.
And next use pytesseract to loop over images one by one to extract text from image.
I just stumbled over the same problem... what you could do is call tesseract directly
# test.py
import subprocess
in_filename = 'file_0.tiff'
out_filename = 'out'
lang = 'spa'
subprocess.call(['tesseract', in_filename, '-l', lang, out_filename ])
would process all pages
$ python test.py
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Page 3
Related
With my OCR scanner, I scan PDF files of 12 to 15 pages at a time. However, I only need the information from the first 5 pages. Is there a way to add a function to the code below that will only run the first 5 pages of each PDF through the OCR scanner.
import os
from pdf2image import convert_from_path
pytesseract.pytesseract.tesseract_cmd = "Path to pytesseract"
path = r "Path to PDF"
os.chdir(path)
os.getcwd()
for n in FoundFiles:
filePath = path+"\\"+FoundFiles[counterDocuments]
ResultPageNumber = []
st = time.time()
for page_number,page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data,lang='eng')
After this part only if-else queries follow to search for certain words in the PDF to return the found information as CSV later on.
Using Python, I would like to
extract text from a PDF into a txt file (done)
color all numbers and specific strings of the txt file like this example (https://tex.stackexchange.com/questions/521383/how-to-highlight-numbers-only-outside-a-string-in-lstlisting) (not done)
Translate using Google translator all text to EN (not done)
extract images from the PDF file into PNGs/or a new PDF file containing all of the images (not done)
To perform 1. I used the following code which is working
pip install PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
file_path = 'AR_Finland_2021.pdf'
pdf = PdfFileReader(file_path)
with open('AR_Finland_2021.txt', 'w') as f:
for page_num in range(pdf.numPages):
# print('Page: {0}'.format(page_num))
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
print(''.center(100, '-'))
except:
pass
else:
f.write('Page {0}\n'.format(page_num+1))
f.write(''.center(100, '-'))
f.write(txt)
f.close()
To perform 3 (extract images) I tried the following code but always get an error.
pip install PyMuPDF Pillow
pip install PyMuPDF
pip install python-gettext
import fitz
import io
from PIL import Image
# file path you want to extract images from
file = "AR_Finland_2021.pdf"
# open the file
pdf_file = fitz.open(file)
# iterate over PDF pages
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
# printing number of images found in this page
if image_list:
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
Error:
----> 5 image_list = page.getImageList()
AttributeError: 'Page' object has no attribute 'getImageList'
Would someone know how to perform 3 (extract images) and 2 (color numbers and certain strings from the txt file extracted from the PDF)?
You can do:
import fitz
doc = fitz.open("AR_Finland_2021.pdf")
for page in doc:
for img_tuple in page.get_images():
img_dict = doc.extract_image(img_tuple[0])
img_bytes = img_dict['image']
# Do whatever you want with it
See Page.get_images() and Document.extract_image()
To write these images into a new pdf:
doc = fitz.open("/path/to/new/pdf")
page = doc.newPage()
img_location = fitz.Rect(100, 100, 200, 200)
page.insert_image(img_location, stream=img_bytes)
See Rect for different ways to construct the rectangle, but you probably want to use img_tuple[1] from earlier. Again look at get_page_images to see the data available to you there.
I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.
Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.
import PyPDF2
file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)
pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()
I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.
My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.
I have faced a similar issue while reading my pdf files. Hope the below solution helps.
The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.
Below is the testes working code
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
def readPdfFile(filePath):
pages = convert_from_path(filePath, 500)
image_counter = 1
#Part #1 : Converting PDF to images
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
#Part #2 - Recognizing text from the images using OCR
filelimit = image_counter-1 # Variable to get count of total number of pages
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Part 3 - Remove those temp files
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
os.remove(filename)
image_counter = image_counter + 1
return text
I am trying to convert the PDF to Image to proceed further with the Tesseract. It works when I convert using cmd:
magick convert a.pdf b.png
But doesn't work when I try to do the same using Python:
from wand.image import Image
with Image (filename='a.pdf') as img:
img.save(filename = 'sample.png')`
The error I get is:
unable to read image data D:/Users/UserName/AppData/Local/Temp/magick-4908Cq41DDA5FxlX1 # error/pnm.c/ReadPNMImage/1346
I have also installed ghostscipt but the error is still there.
EDIT:
I took the code provided in the reply below and modified it to read all the pages. The original issue is still there and the code below uses pdf2image:
from pdf2image import convert_from_path
import os
pdf_dir = "D:/Users/UserName/Desktop/scraping"
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_name = pdf_file[:-4]
for page in pages:
page.save("%s-page%d.jpg" % (pdf_name, pages.index(page)), "JPEG")
Instead of using wand.image, you can use pdf2image. Install it like this:
pip install pdf2image
Here is a code that loops through every page in the PDF, finally converting them to JPEG:
import os
import tempfile
from pdf2image import convert_from_path
filename = 'target.pdf'
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)
base_filename = os.path.splitext(os.path.basename(filename))[0] + '.jpg'
save_dir = 'dir'
for page in images_from_path:
page.save(os.path.join(save_dir, base_filename), 'JPEG')
I am new to Python. I am attempting to create a Python OCR program, and am following a tutorial online for it. Here is the recommended code I use:
from PIL import Image
from pytesser import *
image_file = 'menu.tif'
im = Image.open(image_file)
text = image_to_string(im)
text = image_file_to_string(image_file)
text = image_file_to_string(image_file, graceful_errors=True)
print "=====output=======\n"
print text
The tutorial link is found here. I am getting this error when running this code however.
from pytesser import *
ImportError: No module named 'pytesser'
I have followed the instructions, from installing OCR here and the PyTesser library here code(dot)google(dot)com/archive/p/pytesser/downloads (sorry because <10 rep i can't post more than 2 links).
This (see gyazo below) is a screenshot of my installation files so far, where "pytesser_v0.0.1" is my pytesser folder, "tesseract-master" was found on GitHub (probably not relevant), and "tessinstall" is the folder where I installed tesseract and finally pyimgr.py is my file I am attempting to run.
gyazo(dot)com/333f8a3333e87895558f26875a8a8487
I was also previously getting an error regarding PIL import Image. I should not be using PIL, so is there any other way I can import Image without PIL? maybe pillow?
My Python version is 3.5.2 and I am using windows 10.
My first hunch is that your library is installed in a place that Python does not know.
import sys
print sys.path
If you execute those lines in Python it will show you where Python will look for eggs. Is the pytesser lib there?
Furthermore: As a side note:
pip3 search tesseract will show you some other tesseract Python packages. So you can use the Python package manager.
Change the code to this:
"""OCR in Python using the Tesseract engine from Google
http://code.google.com/p/pytesser/
by Michael J.T. O'Kelly
V 0.0.1, 3/10/07"""
import PIL.Image
import subprocess
import util
import errors
tesseract_exe_name = 'tesseract' # Name of executable to be called at command line
scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
scratch_text_name_root = "temp" # Leave out the .txt extension
cleanup_scratch_flag = True # Temporary files cleaned up after OCR operation
def call_tesseract(input_filename, output_filename):
"""Calls external tesseract.exe on input file (restrictions on types),
outputting output_filename+'txt'"""
args = [tesseract_exe_name, input_filename, output_filename]
proc = subprocess.Popen(args)
retcode = proc.wait()
if retcode!=0:
errors.check_for_errors()
def image_to_string(im, cleanup = cleanup_scratch_flag):
"""Converts im to file, applies tesseract, and fetches resulting text.
If cleanup=True, delete scratch files after operation."""
try:
util.image_to_scratch(im, scratch_image_name)
call_tesseract(scratch_image_name, scratch_text_name_root)
text = util.retrieve_text(scratch_text_name_root)
finally:
if cleanup:
util.perform_cleanup(scratch_image_name, scratch_text_name_root)
return text
def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True):
"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
converts to compatible format and then applies tesseract. Fetches resulting text.
If cleanup=True, delete scratch files after operation."""
try:
try:
call_tesseract(filename, scratch_text_name_root)
text = util.retrieve_text(scratch_text_name_root)
except errors.Tesser_General_Exception:
if graceful_errors:
im = PIL.Image.open(filename)
text = image_to_string(im, cleanup)
else:
raise
finally:
if cleanup:
util.perform_cleanup(scratch_image_name, scratch_text_name_root)
return text
if __name__=='__main__':
im = PIL.Image.open('phototest.tif')
text = image_to_string(im)
print text
try:
text = image_file_to_string('fnord.tif', graceful_errors=False)
except errors.Tesser_General_Exception, value:
print "fnord.tif is incompatible filetype. Try graceful_errors=True"
print value
text = image_file_to_string('fnord.tif', graceful_errors=True)
print "fnord.tif contents:", text
text = image_file_to_string('fonts_test.png', graceful_errors=True)
print text