How to increase likeliness of image recognition with pytesseract - python

I'm trying to convert this list of images I have to text. The images are fairly small but VERY readable (15x160, with only grey text and a white background) I can't seem to get pytesseract to read the image properly. I tried to increase the size with .resize() but it didn't seem to do much at all. Here's some of my code. Anything new I can add to increase my chances? Like I said, I'm VERY surprised that pytesseract is failing me here, it's small but super readable compared to some of the things I've seem it catch.
for dImg in range(0, len(imgList)):
url = imgList[dImg]
local = "img" + str(dImg) + ".jpg"
urllib.request.urlretrieve(url, local)
imgOpen = Image.open(local)
imgOpen.resize((500,500))
imgToString = pytesseract.image_to_string(imgOpen)
newEmail.append(imgToString)

Setting the Page Segmentation Mode (psm) can probably help.
To get all the available psm enter tesseract --help-psm in your terminal.
Then identify the psm corresponding to your need. Lets say you want to treat the image as a single text line, in that case your ImgToString becomes:
imgToString = pytesseract.image_to_string(imgOpen, config = '--psm 7')
Hope this will help you.

You can perform several pre-processing steps in your code.
1) Use the from PIL import Image and use your_img.convert('L'). There are several other settings you can check.
2) A bit advanced method: Use a CNN. There are several pre-trained CNNs you can use. Here you can find a little bit more detailed information: https://www.cs.princeton.edu/courses/archive/fall00/cs426/lectures/sampling/sampling.pdf
tifi

Related

(OCR) Tesseract not recognizing simple digits

I am using PyTesseract to extract information from multiple images which contain vertically separated prices (one price per line), horizontally aligned to the right like the following image:
Tesseract is not able to extract reliable text with such image, so, image processing has to occur:
Image scaling to 4x;
Binarization
"Bolding";
Gaussian blur;
Which results in the following image:
Pytesseract is successfully able to extract its information (using PSM --6) resulting in a string containing:
96,000,000
94,009,999
90,000,000
85,000,000
78,000,000
70,000,000
66,000,000
However, when Pytesseract is presented with some edge cases like an image with a single digit, recognition fails. Example:
Pre-processed:
post-processed:
Which results in an empty string extracted. This is strange as the number 8 was previously successfully read. What other suggestions should I follow? I've spent endless hours trying to optimize the algorythm without success for such case scenarios.
I had tried the same exact scenario with easyocr. Easyocr is also using tesseract engine internally for optical character recognition. I try with resizing image of custom size (600,600) and fed to easyocr, it worked.
import easyocr
import cv2
image = cv2.imread('7.png')
image = cv2.resize(image,(600,600))
cv2.imwrite('image.png',image)
reader = easyocr.Reader(['en'])
result = reader.readtext('image.png')
texts = [detection[1] for detection in result if detection[2] > 0.5]
print(texts)
The output for first image is,
['96,000,000', '94,009,999', '90,000,000', '85,000,000', '78,000,000', '70,000,000', '66,000,000']
The output for second image is,
['8']
May be this alternate solution work for your case. You can install easyocr bypip install easyocr. Happy coding :)

Using pytesseract to get numbers from an image

I'm trying to take an image that's just a number 1-10 in a bubble font and use pytesseract to get the number.
Picture in question:
Here is an article that makes this process seem straightforward:
https://towardsdatascience.com/read-text-from-image-with-one-line-of-python-code-c22ede074cac
lives = pyautogui.locateOnScreen('pics/lives.png', confidence = 0.9)
ss = pyautogui.screenshot(region=(lives[0]+lives[2],lives[1],lives[2]-6,lives[3]))
ss.save('pics/tempLives.png')
img = cv2.imread('pics/tempLives.png')
cv2.imwrite('pics/testPic.png',img)
test = pytesseract.image_to_string(img)
print(test)
I know 'img' is the same as the image provided because I've used ss.save cv2.imwrite to see it.
I suppose my question is why it works so well in the article yet I cannot manage to get anything to print? I suppose the bubble font makes it trickier, but in the article those blue parentheses were read easily, so that makes me think this font wouldn't be too hard. Thanks for any help!
There are many cases when PyTesseract fails to recognize the text, and in some cases we have to give it some hints.
In the specific image you have posted, we better add config=" --psm 6" argument.
According to Tesseract documentation regarding PSM:
6 Assume a single uniform block of text.
Here is a code sample that manages to identify the text from the posted image:
import cv2
import pytesseract
#pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # May be required when using Windows
img = cv2.imread('pics/testPic.png') # Reading the input image (the PNG image from the posted question).
text = pytesseract.image_to_string(img, config=" --psm 6") # Execute PyTesseract OCR with "PSM 6" configuration (Assume a single uniform block of text)
print(text) # Prints the text (prints 10).
Note:
The OCR is not always working, and there are many techniques to improve the OCR accuracy.

Python Tesseract not recognising number in my image

I've got this picture (preprocessed image) from which I want to extract the numeric values of each line. I'm using pytesseract but it doesnt show any results for this image.
I've tried several config options from other questions like "--psm 13 --oem 3" or whitelisting numbers but nothing yields results.
As a result I usually get just one or two characters or ~5 dots/dashes but nothing even remotly resembling the size of my input.
I hope someone can help me cheers in advance for your time.
pytesseract version: 0.3.8
tesseract version: 5.0.0-alpha.20210506
You must think to use --psm 4, it's more appropriate for your image. I also recommend to rethink about the image pre-process. Tesseract is not perfect and it requires good image as input to work well.
import cv2 as cv
import pytesseract as tsr
img = cv.imread('41DAx.jpg')
img = cv.cvtColor(img, cv.COLOR_BGR2RGB)
config = '--psm 4 -c tessedit_char_whitelist=0123456789,'
text = tsr.image_to_string(img, config=config)
print(text)
The above code was not able to well detect all digts in the image, but almost of them. Maybe with a bit of image pre-processing, you can reach your objective.

Why can't tesseract extract text that has a black background?

I have attached a very simple text image that I want text from. It is white with a black background. To the naked eye it seems absolutely legible but apparently to tesseract it is a rubbish. I have tried changing the oem and psm parameters but nothing seems to work. Please note that this works for other images but for this one.
Please try running it on your machine and see if it works. Or else I might have to change my ocr engine altogether.
Note: It was working earlier until I tried to add black pixels around the image to help the extraction process. Also I don't think that tesseract was trained on black text on a white background. It should be able to do this too. Also if this was true why does it work for other text images that have the same format as this one
Edit: Miraculously I tried running the script again and this time it was able to extract Chand properly but failed in the below mentioned case. Also please look at the parameters I have used. I have read the documentation and I feel this would be the right choice. I have added the image for your reference. It is not about just this image. Why is tesseract failing for such simple use cases?
To find the desired result, you need to know the followings:
Page-segmentation-modes
Suggested Image processing methods
The input images are boldly written, we need to shrink the bold font and then assume the output as a single uniform block of text.
To shrink the images we could use erosion
Result will be:
Erode
Result
CHAND
BAKLIWAL
Code:
# Load the library
import cv2
import pytesseract
# Initialize the list
img_lst = ["lKpdZ.png", "ZbDao.png"]
# For each image name in the list
for name in img_lst:
# Load the image
img = cv2.imread(name)
# Convert to gry-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Erode the image
erd = cv2.erode(gry, None, iterations=2)
# OCR with assuming the image as a single uniform block of text
txt = pytesseract.image_to_string(erd, config="--psm 6")
print(txt)

Pytesseract - output is extremely inaccurate (MAC)

I installed pytesseract via pip and its result is terrible.
As I searched for it, I think I need to give it more data
but I can't find where to put tessedata(traineddata)
since there is no directory like ProgramFile\Tesseract-OCR using Mac.
There is no problem with images' resolution, font or size.
Image whose result is 'ecient Sh Abu'
Because large and clear test images work fine, I think it is a problem about lack of data.
But any other possible solution is welcomed as long as it can read text with Python.
Please help me..
I installed pytesseract via pip and its result is terrible.
Sometimes you need to apply preprocessing to the input image to get accurate results.
Because large and clear test images work fine, I think it is a problem about lack of data. But any other possible solution is welcomed as long as it can read text with Python.
You could say lack of data is a problem. I think you'll find morphological-transformations useful.
For instance if we apply close operation, the result will be:
The image looks similar to the original posted image. However there are slight changes in the output images (i.e. Grammar word is slightly different from the original image)
Now if we read the output image:
English
Grammar Practice
ter K-SAT (1-10)
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("6Celp.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
opn = cv2.morphologyEx(gry, cv2.MORPH_OPEN, None)
txt = image_to_string(opn)
txt = txt.split("\n")
for i in txt:
i = i.strip()
if i != '' and len(i) > 3:
print(i)

Categories

Resources