Extract text from multiple images to CSV using OCR

Extract text from multiple images to CSV using OCR - python

I want to extract texts from thousand of images and put it into a CSV file. Can anyone tell me how to do that? I have images saved on my desktop.

sure.
install pytesseract module using this command:
pip install pytesseract
install tesseract engine executable from this urls:
tesseract cmd 32 bit
or
tesseract cmd 64 bit
create a python script called images_to_csv.py and paste this code:
import pytesseract
from PIL import Image # pip install Pillow
# set tesseract cmd to the be the path to your tesseract engine executable
# (where you installed tesseract from above urls)
# IMPORTANT: this should end with '...\tesseract.exe'
pytesseract.pytesseract.tesseract_cmd = <path_to_your_tesseract_cmd>
# and start doing it
# your saved images on desktop
list_with_many_images = [
"path1",
"path2"
# ...
"pathN"
]
# create a function that returns the text
def image_to_str(path):
""" return a string from image """
return pytesseract.image_to_string(Image.open(path))
# now pure action + csv part
with open("images_content.csv", "w+", encoding="utf-8") as file:
file.write("ImagePath, ImageText")
for image_path in list_with_many_images:
text = image_to_str(image_path)
line = f"{image_path}, {text}\n"
file.write(line)
this is all for beginning.
if you want to use module csv go ahead.
enjoy.

Related

Convert an SVG to grayscale in python

Scenario
I'm trying to convert a colored example.svg file into a grayscaled example_gs.svg in python 3.6 in Anaconda 4.8.3 on a Windows 10 device.
Attempts
First I tried to apply a regex to convert the 'rgb(xxx,yyy,zzz)' to black, but that created a black rectangle losing the image in the process. Next I installed inkscape and ran a grayscale command which appeared to be working but did not modify the example.svg. The third attempt with pillow Image did not load the .svg.
MWE
# conda install -c conda-forge inkscape
# https://www.commandlinefu.com/commands/view/2009/convert-a-svg-file-to-grayscale
# inkscape -f file.svg --verb=org.inkscape.color.grayscale --verb=FileSave --verb=FileClose
import re
import os
import fileinput
from PIL import Image
import cv2
# Doesn't work, creates a black square. Perhaps set a threshold to not convert rgb white/bright colors
def convert_svg_to_grayscale(filepath):
# Read in the file
with open(filepath, 'r') as file :
filedata = file.read()
# Replace the target string
filedata = re.sub(r'rgb\(.*\)', 'black', filedata)
# Write the file out again
with open(filepath, 'w') as file:
file.write(filedata)
# opens inkscape, converts to grayscale but does not actually export to the output file again
def convert_svg_to_grayscale_inkscape(filepath):
command = f'inkscape -f {filepath} --verb=org.inkscape.color.grayscale --verb=FileSave --verb=FileClose'
os.system(f'cmd /k {command}')
# Pillow Image is not able to import .svg files
def grayscale(filepath):
image = Image.open(filepath)
cv2.imwrite(f'{filepath}', image.convert('L'))
# walks through png files and calls function to convert the png file to .svg
def main():
filepath = 'example.svg'
convert_svg_to_grayscale_inkscape(filepath)
convert_svg_to_grayscale(filepath)
grayscale(filepath)
if __name__ == '__main__':
main()
Question
How could I change a colored .svg file into a grayscaled image in python 3.6 on a windows device?

You have chosen the right tool for converting to grayscale. Your last attempt is good but you need to import cairosvg that provides the svg2png function. Then, load the png file with Pillow and convert it to an np.array and then you can easily load it with openCV and convert it to grayscale as you did. At last you can use svglib and reportlab to export the images in svg.
Use this snippet as an example:
https://stackoverflow.com/a/62345450/13605264

Extract images from word document using Python

How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.
profile_path = <file path>
result=mammoth.convert_to_html( profile_path)
f = open(profile_path, 'rb')
b = open(profile_html, 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()

You can use the docx2txt library, it will read your .docx document and export images to a directory you specify (must exist).
!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
After execution you will have the images in /home/example/img/ and the variable text will have the document text. They would be named image1.png ... imageN.png in order of appearance.
Note: Word document must be in .docx format.

Extract all the images in a docx file using python
1. Using docxtxt
import docx2txt
#extract text
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")
2. Using aspose
import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
shape = shape.as_shape()
if (shape.has_image) :
# set image file's name
imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
# save image
shape.image_data.save(imageFileName)
imageIndex += 1

Native without any lib
To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.
shell out to OS and run
tar -m -xf DocxWithImages.docx word/media
You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.
You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)

Look the Alderven's answer at Extract all the images in a docx file using python
The zipfile works for more image formats than the docx2txt. For example, EMF images are not extracted by docx2txt but can be extracted by zipfile.

Crop a pdf page to content

Using Python, is it possible to crop a pdf page to the content as shown in the image below where the task is achieved in Inkscape? The bounding area for the content should be found automatically.
Using PyPDF2 I can crop the page, but it requires the coordinates to be manually found, which is tedious for a large number of files. In Inkscape, the coordinates are automatically found.
The code I'm using is shown below and an example input file is available here.
# Python 3.7.0
import PyPDF2 # version 1.26.0
with open('document-1.pdf','rb') as fin:
pdf = PyPDF2.PdfFileReader(fin)
page = pdf.getPage(0)
# Coordinates found by inspection.
# Can these coordinates be found automatically?
page.cropBox.lowerLeft=(88,322)
page.cropBox.upperRight = (508,602)
output = PyPDF2.PdfFileWriter()
output.addPage(page)
with open('cropped-1.pdf','wb') as fo:
output.write(fo)

I was able to do this with the pip-installable CLI https://pypi.org/project/pdfCropMargins/
Since I originally answered, a Python interface has been added: https://github.com/abarker/pdfCropMargins#python-interface (h/t #Paul)
My original answer calling it from the commandline is below.
Unfortunately, I don't believe there's a great way to call it directly from a script, so for now I'm using os.system.
$ python -m pip install pdfCropMargins --user
$ pdf-crop-margins document.pdf -o output.pdf -p 0
import os
os.system('pdf-crop-margins document.pdf -o output.pdf -p 0')

Detecting Bangla character using pytesseract

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text
The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.
P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.
Can anyone help me to solve this problem?
sample of input.png

I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0\tessdata in my PC. Then run,
> tesseract -l ben bangla.jpg bangla_out
in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.
Have you tried to run tesseract in command prompt to verify if it works for -l ben?
EDIT:
Used Spyder, similar to PyCharm, which comes with Anaconda to test
it. Modified your code to call Tesseract as below.
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
Test Code in Spyder:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os
im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text
It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.
Result from the processed bangla_pp.jpg:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
-~~-<~~~~--
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
= পাবেন তার
Result from original image, directly feed to Tesseract.
Code:
from PIL import Image
import pytesseract as tess
print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')
Output:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

I have installed some fonts in windows from here
https://www.omicronlab.com/bangla-fonts.html
After that, it worked perfectly fine for me in Pycharm.

Tesseract could not recognize text from a pdf file

I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from multiple images to CSV using OCR - python

I want to extract texts from thousand of images and put it into a CSV file. Can anyone tell me how to do that? I have images saved on my desktop.

Related

Convert an SVG to grayscale in python

Extract images from word document using Python

Crop a pdf page to content

Detecting Bangla character using pytesseract

Tesseract could not recognize text from a pdf file

Categories

Resources