How to read punctuation characters like '/', '_' and '\' from an image

How to read punctuation characters like '/', '_' and '\' from an image - python

I want my program to read /, _, and \ from an image but sometimes it reads / as I and /_\ as A. I am using the pytesseract library to do this.
Is there a way to specifically read characters like /_ and \?

You can use pytesseract.image_to_string to read text from an image. Depending on your image, you may want to perform preprocessing before throwing it into Pytesseract. This can be a combination of thresholding, blurring, or smoothing techniques using morphological operations. Using this example image,
Here's the result printed to the console
We use the --psm 6 config flag since we want to treat the image as a single uniform block of text. Here's some additional configuration flags that could be useful
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png',0)
data = pytesseract.image_to_string(image, lang='eng',config='--psm 6')
print('Result:', data)

Related

How to setup Tesseract OCR properly

I am using Tesseract OCR trying to convert a preprocessed license plate image into text, but I have not had much success with some images which look very much OK. The tesseract setup can be seen in the function definition. I am running this on Google Colab. The input image is ZG NIVEA 1 below. I am not sure if I am using something wrong or if there is a better way to do this - the result I get form this particular image is A.
!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re
def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
'''
oem - OCR Engine Mode
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
psm - Page Segmentation Mode
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
'''
tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
return ''.join(regex_result)
image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))
Edit: The approach in the accepted answer works for the ZGNIVEA1 image, but not for others, e.g. , is there a general "font size" that Tesseract OCR works with best, or is there a rule of thumb?

by applying gaussian blur before OCR, I ended up with the correct output. Also, you may not need to use regex by adding -c tessedit_char_whitelist=ABC.. to your config string.
The code that produces correct output for me:
import cv2
import pytesseract
image = cv2.imread("images/tesseract.png")
config = '--oem 3 --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'
image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)
string = pytesseract.image_to_string(image, config=config)
print(string)
Output:
Answer 2:
Sorry for the late reply. I tested the same code on your second image, and it gave me correct output, are you sure you removed the config part since it doesnt allow numbers in my whitelist.
Most accurate solution here is training your own tesseract model on license plates' fonts (FE-Schrift) instead of tesseract's default eng.traineddata model. It will definetly increase the accuracy since it only contains your case's characters as output classes. As answer to your latter question, tesseract does some preprocessing before the recognition process (thresholding, morphological closing etc.) that is why image it is so sensitive to letter size. (smaller image: contours are closer to eachother so closing will not seperate them).
To train tesseract on custom font you can follow the official docs
To read more about Tesseract's theoritical part you can check these papers:
1 (relatively old)
2 (newer)

How to read digits from an image using pytesseract

I'm trying to read the digits from this image:
Using pytesseract with these settings:
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)
This is the output:
((E ST7 [71aT6T2 ] THETOGOG5 15 [8)

Whitelisting only integers, as well as changing your psm provides much better results. You also need to remove carriage returns, and white space. Below is code that does that.
import pytesseract
import re
from PIL import Image
#Open image
im = Image.open("numbers.png")
#Define configuration that only whitelists number characters
custom_config = r'--oem 3 --psm 11 -c tessedit_char_whitelist=0123456789'
#Find the numbers in the image
numbers_string = pytesseract.image_to_string(im, config=custom_config)
#Remove all non-number characters
numbers_int = re.sub(r'[a-z\n]', '', numbers_string.lower())
#print the output
print(numbers_int)
The result of the code on your image is: '31477423353'
Unfortunately, a few numbers are still missing. I tried some experimentation, and downloaded your image and erased the grid.
After removing the grid and executing the code again, pytesseract produces a perfect result: '314774628300558'
So you might try to think about how you can remove the grid programmatically. There are alternatives to pytesseract, but regardless you will get better output with the text isolated in the image.

Python PyTesseract Module returning gibberish from an image

I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.
When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right.
Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?
I know I kinda sound dumb but somebody pls help

You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.
Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:
Firstly I applied some blurring to smoothing:
import cv2
blurred = cv2.blur(img,(5,5))
Then to remove everything except text, converted image to grayscale and applied thresholding to get only white color which is the text color (I used inverse thresholding to make text black which is the optimum condition for tesseract ocr):
gray_blurred=cv2.cvtColor(blurred, cv2.COLOR_BGR2GRAY)
ret,th1 = cv2.threshold(gray_blurred,239,255,cv2.THRESH_BINARY_INV)
and applied ocr then removed whitespace characters :
txt = pytesseract.image_to_string(th1,lang='eng', config='--psm 12')
txt = txt.replace("\n", " ").replace("\x0c", "")
print(txt)
>>>"WINNING'OLYMPIC GOLD MEDAL IT'S MADE OUT OF RECYCLED ELECTRONICS "
Related topics:
Pytesser set character whitelist
Pytesseract OCR multiple config options
You can also try preprocessing your image to let pytesseract work more accurate and if you want to recognize meaningful words you can apply spell check after ocr:
https://pypi.org/project/pyspellchecker/

Tesseract - digit regonition with many errors

I want to be able to recognize digits from images. So I have been playing around with tesseract and python. I looked into how to prepare the image and tried running tesseract on it and I must say I am pretty disappointed by how badly my digits are recognized. I have tried to prepare my images with OpenCV and thought I did a pretty good job (see examples below) but tesseract has a lot of errors when trying to identify my images. Am I expecting too much here? But when I look at these example images I think that tesseract should easily be able to identify these digits without any problems. I am wondering if the accuracy is not there yet or if somehow my configuration is not optimal. Any help or direction would be gladly appreciated.
Things I tried to improve the digit recognition: (nothing seemed to improved the results significantly)
limit characters: config = "--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789"
Upscale images
add a white border around the image to give the letters more space, as I have read that this improves the recognition process
Threshold image to only have black and white pixels
Examples:
Image 1:
Tesseract recognized: 72
Image 2:
Tesseract recognized: 0
EDIT:
Image 3:
https://ibb.co/1qVtRYL
Tesseract recognized: 1723

I'm not sure what's going wrong for you. I downloaded those images and tesseract interprets them just fine for me. What version of tesseract are you using (I'm using 5.0)?
781429
209441
import pytesseract
import cv2
import numpy as np
from PIL import Image
# set path
pytesseract.pytesseract.tesseract_cmd = r'C:\\Users\\ichu\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe';
# load images
first = cv2.imread("first_text.png");
second = cv2.imread("second_text.png");
images = [first, second];
# convert to pillow
pimgs = [];
for img in images:
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB);
pimgs.append(Image.fromarray(rgb));
# do text
for img in pimgs:
text = pytesseract.image_to_string(img, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789');
print(text[:-2]); # drops newline + end char

Empty string with Tesseract

I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract.
The code is just this line:
pytesseract.image_to_string(cv2.imread("img.png"), lang="eng")
Is there anything I can try to be able to read these kind of images?
Thanks in advance
Edit:

Thresholding the image before passing it to pytesseract increases the accuracy.
import cv2
import numpy as np
# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)
# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))
print(pytesseract.image_to_string(img))
This printed out
5.78 / C02
Edit:
Doing just thresholding on the second image returns 11.1. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7. Doing this on the second image returns 11.1 "202 ', with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%. Everything together:
pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')
This returns 11.1 202. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read punctuation characters like '/', '_' and '\' from an image - python

I want my program to read /, _, and \ from an image but sometimes it reads / as I and /_\ as A. I am using the pytesseract library to do this. Is there a way to specifically read characters like /_ and \?

Related

How to setup Tesseract OCR properly

How to read digits from an image using pytesseract

Python PyTesseract Module returning gibberish from an image

Tesseract - digit regonition with many errors

Empty string with Tesseract

Categories

Resources