tesseract image to only digits/numbers - python

Yes I know there is a lot of question for this subject but I can't find a solution.
I use python 3.8, tesseract v5.0
In this web site , there is a captcha image
https://medeczane.sgk.gov.tr/eczane/SayiUretenImageYeniServlet
(it changes everytick)
I download an image to try as example.jpeg
I try to change for example,
config="digits" ,config="tessedit_char_whitelist=0123456789"
then I try what I find in web. example.jpy is 448301.
I want image to digit numbers and integer type. How can I do that?
numbers = 4 ON\n\x0c
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
numbers=pytesseract.image_to_string(Image.open('example.jpg'),config="digits")

Related

Python Tesseract not recognising number in my image

I've got this picture (preprocessed image) from which I want to extract the numeric values of each line. I'm using pytesseract but it doesnt show any results for this image.
I've tried several config options from other questions like "--psm 13 --oem 3" or whitelisting numbers but nothing yields results.
As a result I usually get just one or two characters or ~5 dots/dashes but nothing even remotly resembling the size of my input.
I hope someone can help me cheers in advance for your time.
pytesseract version: 0.3.8
tesseract version: 5.0.0-alpha.20210506
You must think to use --psm 4, it's more appropriate for your image. I also recommend to rethink about the image pre-process. Tesseract is not perfect and it requires good image as input to work well.
import cv2 as cv
import pytesseract as tsr
img = cv.imread('41DAx.jpg')
img = cv.cvtColor(img, cv.COLOR_BGR2RGB)
config = '--psm 4 -c tessedit_char_whitelist=0123456789,'
text = tsr.image_to_string(img, config=config)
print(text)
The above code was not able to well detect all digts in the image, but almost of them. Maybe with a bit of image pre-processing, you can reach your objective.

Tesseract - digit regonition with many errors

I want to be able to recognize digits from images. So I have been playing around with tesseract and python. I looked into how to prepare the image and tried running tesseract on it and I must say I am pretty disappointed by how badly my digits are recognized. I have tried to prepare my images with OpenCV and thought I did a pretty good job (see examples below) but tesseract has a lot of errors when trying to identify my images. Am I expecting too much here? But when I look at these example images I think that tesseract should easily be able to identify these digits without any problems. I am wondering if the accuracy is not there yet or if somehow my configuration is not optimal. Any help or direction would be gladly appreciated.
Things I tried to improve the digit recognition: (nothing seemed to improved the results significantly)
limit characters: config = "--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789"
Upscale images
add a white border around the image to give the letters more space, as I have read that this improves the recognition process
Threshold image to only have black and white pixels
Examples:
Image 1:
Tesseract recognized: 72
Image 2:
Tesseract recognized: 0
EDIT:
Image 3:
https://ibb.co/1qVtRYL
Tesseract recognized: 1723
I'm not sure what's going wrong for you. I downloaded those images and tesseract interprets them just fine for me. What version of tesseract are you using (I'm using 5.0)?
781429
209441
import pytesseract
import cv2
import numpy as np
from PIL import Image
# set path
pytesseract.pytesseract.tesseract_cmd = r'C:\\Users\\ichu\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe';
# load images
first = cv2.imread("first_text.png");
second = cv2.imread("second_text.png");
images = [first, second];
# convert to pillow
pimgs = [];
for img in images:
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB);
pimgs.append(Image.fromarray(rgb));
# do text
for img in pimgs:
text = pytesseract.image_to_string(img, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789');
print(text[:-2]); # drops newline + end char

Empty string with Tesseract

I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract.
The code is just this line:
pytesseract.image_to_string(cv2.imread("img.png"), lang="eng")
Is there anything I can try to be able to read these kind of images?
Thanks in advance
Edit:
Thresholding the image before passing it to pytesseract increases the accuracy.
import cv2
import numpy as np
# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)
# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))
print(pytesseract.image_to_string(img))
This printed out
5.78 / C02
Edit:
Doing just thresholding on the second image returns 11.1. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7. Doing this on the second image returns 11.1 "202 ', with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%. Everything together:
pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')
This returns 11.1 202. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.

Detect text in image and trace on them

I need some help with image processing , I'm working on a script which can detect Alphabets on the image and trace them , For example if there is a letter A in the Image then code has to detect it and trace(side-by-side not over same line) 3-4 times (with different colors) with given distance based on width of the text . As of now I'am able to detect the words and font and size using tesserocr module, but I'm unable to do the tracing thing .
import io
import tesserocr
from PIL import Image
with tesserocr.PyTessBaseAPI() as api:
image = Image.open("1d.png")
api.SetImage(image)
api.Recognize()
iterator = api.GetIterator()
print iterator.WordFontAttributes()
Thanks in advance #peace

Detecting Bangla character using pytesseract

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text
The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.
P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.
Can anyone help me to solve this problem?
sample of input.png
I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0\tessdata in my PC. Then run,
> tesseract -l ben bangla.jpg bangla_out
in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.
Have you tried to run tesseract in command prompt to verify if it works for -l ben?
EDIT:
Used Spyder, similar to PyCharm, which comes with Anaconda to test
it. Modified your code to call Tesseract as below.
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
Test Code in Spyder:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os
im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text
It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.
Result from the processed bangla_pp.jpg:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
-~~-<~~~~--
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
= পাবেন তার
Result from original image, directly feed to Tesseract.
Code:
from PIL import Image
import pytesseract as tess
print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')
Output:
প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে
প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার
I have installed some fonts in windows from here
https://www.omicronlab.com/bangla-fonts.html
After that, it worked perfectly fine for me in Pycharm.

Categories

Resources