Tesseract can not recognize captcha text - python

I am trying to recognize the text in a captcha and it is not possible for me. I am using python3, openCv and tesseract.
The simplified code is:
import cv2
from pytesseract import *
img_path = "path"
img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_LINEAR)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
pytesseract.image_to_string(img)
I think I should remove the color lines first, then leave the text alone, and maybe change the brightness and contrast. What filter could apply?
These are some images to recognize.

For recognising captcha text using pytesseract-ocr, you can do the following..
Prepare custom train_set to training your tesseract instance to recognise a specific font [Optional]
The captcha images need some pre-processing(such as * Apply Black & White Filter > Scale(up) > Blur > Morphological Transformation + Adaptive threshold*)to enhance the text part and reduce the noises/lines.
For removing lines: In the sample images only the text can be seen in black color and there is no black line, so you can simply convert the each non-black pixel to white by using PIL or OpenCV, you can even utilize some specific algo like Hough Line Transform to detect and remove lines.
You can learn about all these filters and algos from the official documentation and tutorial on OpenCV website.

Related

Can't extract text from an image with python OCR pytesseract

I'm trying to extract texts from some images. It worked for hundreds of other images but in some cases it doesn't find any texts. In order to optimize the images for extraction phase, all images are converted to black and white. All of their backgrounds are white and others are black such as icons, texts etc.
For example it worked for below image and succesfully found 'Sleep Timer' text in the image. I'm not sure if it's relevant but size of the below image with 'Sleep Timer' text is 320 × 351
But for the below image it doesn't find any text at all. Image size for this one is 161 × 320.
Since I couldn't find the reason, I tried to resize the image but it didn't work.
Here is my code:
from pytesseract import Output
import pytesseract
import cv2
image = cv2.imread('imagePath')
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)
for i in range(0, len(results["text"])):
text = results["text"][i]
conf = int(results["conf"][i])
print("Confidence: {}".format(conf))
print("Text: {}".format(text))
print("")
It is working for me I tested:
import pytesseract
print(pytesseract.image_to_string('../images/grmgrm.jfif'))
results = pytesseract.image_to_data('../images/grmgrm.jfif', output_type=pytesseract.Output.DICT)
print(results)
Are you getting an error? Show us the error you are getting.

Why can't tesseract extract text that has a black background?

I have attached a very simple text image that I want text from. It is white with a black background. To the naked eye it seems absolutely legible but apparently to tesseract it is a rubbish. I have tried changing the oem and psm parameters but nothing seems to work. Please note that this works for other images but for this one.
Please try running it on your machine and see if it works. Or else I might have to change my ocr engine altogether.
Note: It was working earlier until I tried to add black pixels around the image to help the extraction process. Also I don't think that tesseract was trained on black text on a white background. It should be able to do this too. Also if this was true why does it work for other text images that have the same format as this one
Edit: Miraculously I tried running the script again and this time it was able to extract Chand properly but failed in the below mentioned case. Also please look at the parameters I have used. I have read the documentation and I feel this would be the right choice. I have added the image for your reference. It is not about just this image. Why is tesseract failing for such simple use cases?
To find the desired result, you need to know the followings:
Page-segmentation-modes
Suggested Image processing methods
The input images are boldly written, we need to shrink the bold font and then assume the output as a single uniform block of text.
To shrink the images we could use erosion
Result will be:
Erode
Result
CHAND
BAKLIWAL
Code:
# Load the library
import cv2
import pytesseract
# Initialize the list
img_lst = ["lKpdZ.png", "ZbDao.png"]
# For each image name in the list
for name in img_lst:
# Load the image
img = cv2.imread(name)
# Convert to gry-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Erode the image
erd = cv2.erode(gry, None, iterations=2)
# OCR with assuming the image as a single uniform block of text
txt = pytesseract.image_to_string(erd, config="--psm 6")
print(txt)

Is it possible to extract text from specific portion of image using pytesseract

I have bounding box(coordinate of rectangle) in an image and want to extract text within that coordinates. How can I use pytesseract to extract text within that coordinates?
I tried copying the image portion to other numpyarray using opencv like
cropped_image = image[y1:y2][x1:x2]
and tried pytesseract.image_to_string(). But the accuracy was very poor.
But when I tried original image to pytesseract.image_to_string() it extracted every thing perfectly..
Is there any function to extract specific portion of image using pytesseract?
This image has different sections of information consider I have rectangle coordinates enclosing 'Online food delivering system' how to extract that data in pytessaract?
Please help
Thanks in advance
Versions I am using:
Tesseract 4.0.0
pytesseract 0.3.0
OpenCv 3.4.3
There's no built in function to extract a specific portion of an image using Pytesseract but we can use OpenCV to extract the ROI bounding box then throw this ROI into Pytesseract. We convert the image to grayscale then threshold to obtain a binary image. Assuming you have the desired ROI coordinates, we use Numpy slicing to extract the desired ROI
From here we throw it into Pytesseract to get our result
ONLINE FOOD DELIVERY SYSTEM
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.jpg', 0)
thresh = 255 - cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
x,y,w,h = 37, 625, 309, 28
ROI = thresh[y:y+h,x:x+w]
data = pytesseract.image_to_string(ROI, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('ROI', ROI)
cv2.waitKey()

Why does pytesseract fail to recognise digits from image with darker background?

I've this python code which I use to convert a text written in a picture to a string, it does work for certain images which have large characters, but not for the one I'm trying right now which contains only digits.
This is the picture:
This is my code:
import pytesseract
from PIL import Image
img = Image.open('img.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
result = pytesseract.image_to_string(img)
print (result)
Why is it failing at recognising this specific image and how can I solve this problem?
I have two suggestions.
First, and this is by far the most important, in OCR preprocessing images is key to obtaining good results. In your case I suggest binarization. Your images look extremely good so you shouldn't have any problem but if you do, then maybe you should try to binarize your images:
import cv2
from PIL import Image
img = cv2.imread('gradient.png')
# If your image is not already grayscale :
# img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 180 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
And then try the ocr again with the binarized image.
Check if your image is in grayscale and uncomment if needed.
This is simple thresholding. Adaptive thresholding also exists but it is noisy and does not bring anything in your case.
Binarized images will be much easier for Tesseract to handle. This is already done internally (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) but sometimes things can be messed up and very often it's useful to do your own preprocessing.
You can check if the threshold value is right by looking at the images :
import matplotlib.pyplot as plt
plt.imshow(img, cmap='gray')
plt.imshow(img_binarized, cmap='gray')
Second, if what I said above still doesn't work, I know this doesn't answer "why doesn't pytesseract work here" but I suggest you try out tesserocr. It is a maintained python wrapper for Tesseract.
You could try:
import tesserocr
text_from_ocr = tesserocr.image_to_text(pil_img)
Here is the doc for tesserocr from pypi : https://pypi.org/project/tesserocr/
And for opencv : https://pypi.org/project/opencv-python/
As a side-note, black and white is treated symetrically in Tesseract so having white digits on a black background is not a problem.

Python, text detection OCR

I am trying to extract data from a scanned form. The form has a standard format similar to the one shown in the image below:
I have tried using pytesseract (tesseract OCR) to detect the image's text and it has done a decent job at finding the text and converting the image to text.
However it essentially just gives me all the detected text without keeping the format of the data.
I would like to be able to do something like the below:
Find a particular piece of text and then find the associated data below or beside it. Similar to this question using opencv Detect text region in image using Opencv
Is there a way that I can essentially do the following:
Either find all text boxes on the form, perform OCR on each box and see which one is the closest match to the "witnesess:" text, then find the sections immediately below it and perform separate OCR on those.
Or if the form is standard and I know the approximate location of the "witness" text section can I specify its general location in opencv and then just extract the below text and perform OCR on it.
EDIT: I have tried the below code to try to detect specific regions of the text. However it is not specifically identifying the text just all regions.
import cv2
img = cv2.imread('t2.jpg')
mser = cv2.MSER_create()
img = cv2.resize(img, (img.shape[1]*2, img.shape[0]*2))
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
vis = img.copy()
regions = mser.detectRegions(gray)
hulls = [cv2.convexHull(p.reshape(-1, 1, 2)) for p in regions[0]]
cv2.polylines(vis, hulls, 1, (0,255,0))
cv2.imshow('img', vis)
Here is the result:
I think you have the answer already in your own post.
I did recently something similar and this is how I did it:
//id_image was loaded with cv2.imread
temp_image = id_image[start_y:end_y,start_x:end_x]
img = Image.fromarray(temp_image)
text = pytesseract.image_to_string(img, config="-psm 7")
So basically, if your format is predefined, you just need to know the location of the fields that you want the text of (which you already know), crop it, and then apply the ocr (tesseract) extraction.
In this case you need import pytesseract, PIL, cv2, numpy.

Categories

Resources