I need to convert lots of jpg/png files to docx files & then to pdf. My sole concern is to write the data in an image to a pdf file & if I need to edit any text manually, I can do that in word & save it in the corresponding pdf file.
I've tried using API but failed as the text is not correctly matching.
My image files contain only texts & not anything else.
I already have docx to pdf conversion code in Python.
from docx2pdf import convert
input = 'INPUT_FILE_NAME.docx'
output = 'OUTPUT_FILE_NAME.pdf'
convert(input)
convert(input, output)
convert("Output")
Kindly suggest me how to convert a png/jpg file to docx. Thanks.
EDIT --------------
I've successfully made this code run. I've uploaded in my github repo.
from PIL import Image
from pytesseract import pytesseract
#Define path to tessaract.exe
path_to_tesseract = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
#Define path to image
path_to_image = 'texttoimage.png'
#Point tessaract_cmd to tessaract.exe
pytesseract.tesseract_cmd = path_to_tesseract
#Open image with PIL
img = Image.open(path_to_image)
#Extract text from image
text = pytesseract.image_to_string(img)
print(text)
Related
I have this piece of code in Python that makes use of pytesseract (method pytesseract.image_to_data).
This gives me great text information and coordinates that are saved in a text file that is fed to a third party software. It works perfectly for PDF files that have been scanned
data = pytesseract.image_to_data(Image.open('file-001-page-001.png')))
The issue now is that I have a demand for output in the exact same structure for PDFs that already contain text. It's possible to keep the same code and continue as if the PDF had no text, extracting images and doing OCR, but it doesn't seem like the right solution...
Is it possible to achieve this with pytesseract?
Suggestions are welcome
You can use this:
import pytesseract
from PIL import Image
# Open the PDF file
with open('file.pdf', 'rb') as f:
# Extract text from the PDF file and save it to a variable
text = pytesseract.image_to_pdf_or_hocr(f, extension='hocr', lang='eng', config='--oem 3 --psm 6')
# Save the extracted text to a file in the desired format
with open('output.hocr', 'w')as f:
f.write(text)
I have tried this way to workaround:
from pytesseract import pytesseract
from PIL import Image
img = Image.open('img.jpg')
text = pytesseract.image_to_string(img, config='')
# Displaying the extracted text
print(text[:-1])
But this code does not extract all the text.
Here is the output output
I have a collection of pdfs, each containing a scan of an A4 paper, that are different in size. I would like to convert them to an image and fix the resolution of the outgoing image.
My code to convert to jpg (without resizing):
from pdf2image import convert_from_path
filename_in = 'myfile.pdf'
filename_out = 'myfile.jpg'
jpeg = convert_from_path( filename_in )
jpeg[0].save( filename_out , 'JPEG' )
If the pdf I am trying to convert has any colour in it, the above does not work and the outgoing image is completely white (with non-zero dimensions). Is this a known problem and does a solution exist?
I am using Python 3.7.3.
I am unable to share the pdf files as they contain private information.
You can try to extract the images and correct resolutions instead of converting PDFs.
Try pdfreader, here is a sample code extracting all images (the both inline and XObject) from a doc.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
images = []
try:
while True:
viewer.render()
images.extend(viewer.canvas.inline_images)
images.extend(viewer.canvas.images.values())
viewer.next()
except PageDoesNotExist:
pass
Then you can convert images to PIL/Pillow object and save (or do whatever you need)
for i, img in enumerate(images):
img.to_Pillow().save("{}.png".format(i))
How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.
profile_path = <file path>
result=mammoth.convert_to_html( profile_path)
f = open(profile_path, 'rb')
b = open(profile_html, 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
You can use the docx2txt library, it will read your .docx document and export images to a directory you specify (must exist).
!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
After execution you will have the images in /home/example/img/ and the variable text will have the document text. They would be named image1.png ... imageN.png in order of appearance.
Note: Word document must be in .docx format.
Extract all the images in a docx file using python
1. Using docxtxt
import docx2txt
#extract text
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")
2. Using aspose
import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
shape = shape.as_shape()
if (shape.has_image) :
# set image file's name
imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
# save image
shape.image_data.save(imageFileName)
imageIndex += 1
Native without any lib
To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.
shell out to OS and run
tar -m -xf DocxWithImages.docx word/media
You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.
You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)
Look the Alderven's answer at Extract all the images in a docx file using python
The zipfile works for more image formats than the docx2txt. For example, EMF images are not extracted by docx2txt but can be extracted by zipfile.
I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))