Tesseract unable to read math expression - python

I got this image of a simple math expression Tesseract fails to read:
I've tested a screenshot of the same expression written on an Android phone and it was read pretty well. So I thought it's a font problem.
I considered:
Preprocess the image by inverting or removing the red areas
Training Tesseract with images (StackOverflow question with no answers)
Using WhatFontIs.com to find similar font then training Tesseract with the font file with TrainYourTesseract

But as I was typing the question, I looked around for more.
And this answer prompted me to double check my sanity with this VietOCR software which outputs 8-3, close enough!
Then I messed around the software and found that I could pass --psm 7 (Page Segmentation Mode 7: Treat the image as a single text line) to my script, which works well for my math expressions:
pytesseract.image_to_string(img, config='--psm 7')
List of PSMs

Related

How to read digital digits from an image (LCD screen) using python

I need to extract digital digits from the digital weighing scale. OCR and Pytesseract libraries are not working for this problem statement. Is there any better libraries or solution?
Below is the input image from which I'm trying to extract the digits.
Input image
Perhaps this article is what you're looking for: https://pyimagesearch.com/2017/02/13/recognizing-digits-with-opencv-and-python/
The article demonstrates the use of OpenCV to read LCD-digits off of tiny displays.

Tesseract unable to read simple number

I have this image and I need tesseract to read the value.
import cv2
import pytesseract
im = cv2.imread("num.png")
print(pytesseract.image_to_string(im))
It does not print anything. Am I doing something wrong since it is pretty clear that it is a 7.
Even after scaling the image up by 5x with intercubic it still would not work. This is the image now
As described here:
By default Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the --psm argument.
In this case, --psm from 6 to 10 should work fine. Example:
pytesseract.image_to_string(im, config='--psm 6')
The code is correct. I think that image of 7 is not clear enough for pytesseract. You need to preprocess the image. This link might help.

OCR Tesseract - Get Image Font Attributes

I have been using Pytesseract to extract text from image. I am currently in a restoration task of an image document. Aside from extracting text from an image, I also wanted to identify each words font, font size, whether the character is capital or not, italicized or not, bold or not and so and so forth. Is this currently possible with Tesseract? I have read the documentation of Pytesseract, but found none about it. If this is not possible, how can I make it happen? Is there any open source font recognition API's? Thanks.

Reading CAPTCHA using tesseract is giving wrong readings

from urllib import urlopen,urlretrieve
from PIL import Image,ImageOps
from bs4 import BeautifulSoup
import requests
import subprocess
def cleanImage(imagePath):
image=Image.open(imagePath)
image=image.point(lambda x:0 if x<143 else 255)
borederImage=ImageOps.expand(image,border=20,fill="white")
borederImage.save(imagePath)
html=urlopen("http://www.pythonscraping.com/humans-only")
soup=BeautifulSoup(html,'html.parser')
imageLocation=soup.find('img',{'title':'Image CAPTCHA'})['src']
formBuildID=soup.find('input',{'name':'form_build_id'})['value']
captchaSID=soup.find('input',{'name':'captcha_sid'})['value']
captchaToken=soup.find('input',{'name':'captcha_token'})['value']
captchaURL="http://pythonscraping.com"+imageLocation
urlretrieve(captchaURL,"captcha.jpg")
cleanImage("captcha.jpg")
p=subprocess.Popen(['tesseract','captcha.jpg',"captcha"],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
p.wait()
f=open('captcha.txt','r')
captchaResponce=f.read().replace(" ","").replace("\n","")
print "captcha responce attempt "+ captchaResponce+"\n\n"
try:
print captchaResponce
print len(captchaResponce)
print type(captchaResponce)
except:
print "No way"
Hello
This is my code for a testing site to download the CAPTCHA image(each time you open site you'll get a different CAPTCHA),then read it using tesseract in python.
I have tried to download the image directly and read it directly using tesseract it didn't get the correct CAPTCHA reading,so i added the function cleanImage to help but also it didn't read it correctly.
After searching online, my problem seems to be with tesseract not being "trained" to process the images correctly.
Any help is much appreciated.
**this code is from web-scraping book ,also this example purpose is to read the CAPTCHA &submit the form. This is in no way an attack or offensive tool to overload or harm the site.
I used tesseract to solve captchas with nodejs. To get it running you need to do some image proccessing first (Depending on the captcha you try to solve).
If you take this type of captcha for example I did:
Remove "white noise"
Remove gray lines
Remove gray dots
Fill gaps
Change to grayscale image
NOW do OCR with tesseract
You can check out the code, how its done, and more docu here: https://github.com/cracker0dks/CaptchaSolver
Tesseract was trained to do more conventional OCR, and CAPTCHA is very challenging for it as is, because characters are not aligned, may have rotation, overlap and differ in size and fonts. You should try to invoke tesseract with different page segmentaion mode (--psm option). Here is a list of all possible values:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Try modes with OSD (like 1, 7, 11, 12, 13). This will improve you recognition rate. But in order to really improve, you will have to write a program that finds separate letters (segments the image) and sends them to tesseract one by one (using --psm 10). opencv is a great library for image manipulation. This post may be a good start.
Regarding concerns about legitimacy of CAPTCHA recognition: it is ethical problem and lays beyond the scope of SO. Pythonscraping is a classic testing site and I see no problem whatsoever to assist solving the question. This concern is the same as teaching self-defense may be used to attack people.
Anyways, CAPTCHA is a very weak human-confirmation challenge and no site should be using it nowadays, while reCAPTCHA is much stronger, friendlier and free.

detecting font of text in image

i want to detect the font of text in an image so that i can do better OCR on it. searching for a solution i found this post. although it may seem that it is the same as my question, it does not exactly address my problem.
background
for OCR i am using tesseract, which uses trained data for recognizing text. training tesseract with lots of fonts reduces the accuracy which is natural and understandable. one solution is to build multiple trained data - one per few similar fonts - and then automatically use the appropriate data for each image. for this to work we need to be able to detect the font in image.
number 3 in this answer uses OCR to isolate image of characters along with their recognized character and then generates the same character's image with each font and compare them with the isolated image. in my case the user should provide a bounding box and the character associated with it. but because i want to OCR Arabic script(which is cursive and character shapes may vary depending on what other characters are adjacent to it) and because the bounding box may not be actually the minimal bounding box, i am not sure how i can do the comparing.
i believe Hausdorff distance is not applicable here. am i right?
shape context may be good(?) and there is a shapeContextDistanceExtractor class in opencv but i am not sure how i can use it in opencv-python
thank you
sorry for bad English

Categories

Resources