i am writing a program to extract information from Govt. ID's and use contours to extract characters from the image (since passing it as is, into tesseract results in junk output). I have tried this approach with other printed documents and it yields much better results than passing the whole image into tesseract. But for some reason, with the govt id images, it seems to either give "Error in pixGenHalftoneMask: pix too small" (though the images are at least 100x100) or blank output. I am attaching both images used
I used psm 10 for the image with an "M" and 8 for the other image and yet it bore no fruit. I have a feeling the character is just too thick ? i tried erosion as well ..but the end result was the same :( .. any pointers please ?
Related
I'm new to Tesseract and wanted to know if there were any ways to clean up photos for a simple OCR program to get better results. Thanks in advance for any help!
The code I am using:
#loads tesseract
tess.pytesseract.tesseract_cmd =
#filepath
file_path =
image = Image.open(file_path)
#processes image
text = tess.image_to_string(image, config='')
print(text)
I've used pytesseract in the past and with the following four modifications, I could read almost anything as long as the text font wasn't too small to begin with. pytesseract seems to struggle with small writing, even after resizing.
- Convert to Black & White -
Converting the images to black and white would frequently improve the recognition of the program. I used OpenCV to do so and got the code from the end of this article.
- Crop -
If all your photos are in similar format, as in the text you need is always in the same spot, I'd recommend cropping your pictures. If possible, pass only the exact part of the photo to pytesseract that you want analyzed, the less the program has to analyze, the better. In my case, I was taking screenshots and would specify the exact region of where to take one.
- Resize -
Another thing you can do is to play with the scaling of the original photo. Sometimes after resizing to almost double it's initial size pytesseract could read the text a lot easier. Generaly, bigger text is better but there's a limit as the photo can become too pixelated after resizing to be recognizable.
- Config -
I've noticed that pytesseract can recognize text a lot easier than numbers. If possible, break the photo down into sections and whenever you have a trouble spot with numbers you can use:
pytesseract.image_to_string(image, config='digits')
I am working on a document watermarking and recovery project. My code works fine for embedding the watermark and then recovering it. It works great with digitally generated images that haven't been scaled or modified in any way. But, when I use scanned images or modified, The recovered portion is good enough to be read when not zoomed but not great. The problem is that the original images have fringes of colour around the text. Those chunks of colourful pixels cause a bad average to be stored. This average when use for recovery, cause the recovered text, signatures or any other artefacts to have bad colours. Below are some images representing the problem. On the left is the recovered portion while the right has untampered portion. The untampered portion shows how the original image had bad colours all around the text.
How do I remove that bad colouring or blocks?
Any ideas about removing them from the recovered part or the original image itself?
i am working on data which contain text on cable which is attached to board , my objective is extract text from cable , here i have the cropped image image cropped . then I de-skew it and then pass it to tesseract de-skewed image. i get the following output 'QW -CIRE2-9' which is completely irrelevant. could someone please help me resolve this. i have tried other pre processing steps such dilation and shareholding but still it is completely gibberish
from urllib import urlopen,urlretrieve
from PIL import Image,ImageOps
from bs4 import BeautifulSoup
import requests
import subprocess
def cleanImage(imagePath):
image=Image.open(imagePath)
image=image.point(lambda x:0 if x<143 else 255)
borederImage=ImageOps.expand(image,border=20,fill="white")
borederImage.save(imagePath)
html=urlopen("http://www.pythonscraping.com/humans-only")
soup=BeautifulSoup(html,'html.parser')
imageLocation=soup.find('img',{'title':'Image CAPTCHA'})['src']
formBuildID=soup.find('input',{'name':'form_build_id'})['value']
captchaSID=soup.find('input',{'name':'captcha_sid'})['value']
captchaToken=soup.find('input',{'name':'captcha_token'})['value']
captchaURL="http://pythonscraping.com"+imageLocation
urlretrieve(captchaURL,"captcha.jpg")
cleanImage("captcha.jpg")
p=subprocess.Popen(['tesseract','captcha.jpg',"captcha"],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
p.wait()
f=open('captcha.txt','r')
captchaResponce=f.read().replace(" ","").replace("\n","")
print "captcha responce attempt "+ captchaResponce+"\n\n"
try:
print captchaResponce
print len(captchaResponce)
print type(captchaResponce)
except:
print "No way"
Hello
This is my code for a testing site to download the CAPTCHA image(each time you open site you'll get a different CAPTCHA),then read it using tesseract in python.
I have tried to download the image directly and read it directly using tesseract it didn't get the correct CAPTCHA reading,so i added the function cleanImage to help but also it didn't read it correctly.
After searching online, my problem seems to be with tesseract not being "trained" to process the images correctly.
Any help is much appreciated.
**this code is from web-scraping book ,also this example purpose is to read the CAPTCHA &submit the form. This is in no way an attack or offensive tool to overload or harm the site.
I used tesseract to solve captchas with nodejs. To get it running you need to do some image proccessing first (Depending on the captcha you try to solve).
If you take this type of captcha for example I did:
Remove "white noise"
Remove gray lines
Remove gray dots
Fill gaps
Change to grayscale image
NOW do OCR with tesseract
You can check out the code, how its done, and more docu here: https://github.com/cracker0dks/CaptchaSolver
Tesseract was trained to do more conventional OCR, and CAPTCHA is very challenging for it as is, because characters are not aligned, may have rotation, overlap and differ in size and fonts. You should try to invoke tesseract with different page segmentaion mode (--psm option). Here is a list of all possible values:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Try modes with OSD (like 1, 7, 11, 12, 13). This will improve you recognition rate. But in order to really improve, you will have to write a program that finds separate letters (segments the image) and sends them to tesseract one by one (using --psm 10). opencv is a great library for image manipulation. This post may be a good start.
Regarding concerns about legitimacy of CAPTCHA recognition: it is ethical problem and lays beyond the scope of SO. Pythonscraping is a classic testing site and I see no problem whatsoever to assist solving the question. This concern is the same as teaching self-defense may be used to attack people.
Anyways, CAPTCHA is a very weak human-confirmation challenge and no site should be using it nowadays, while reCAPTCHA is much stronger, friendlier and free.
I'm trying to develop a system which can convert a seven-segment display on an old analog pressure output system to text, such that the data can be processed by LabVIEW. I've been working on image processing to get Tesseract (using v3.02) to recognize the numbers correctly, but have been hitting some roadblocks and don't quite know how to proceed. This is what I've got so far:
Image needs to be a height of between 50-100 pixels for Tesseract to read it correctly. I've found the best results with a height of 50.
Image needs to be cropped such that there is only one line of text.
Image should be in black and white
Image should be relatively level from left to right.
I've been using the seven-segment training data 'letsgodigital'. This is the code for the image manipulation I've been doing so far:
ret, i = video.read()
h,width,channels = i.shape #get dimensions
g = cv2.cvtColor(i,cv2.COLOR_BGR2GRAY)
histeq=cv2.equalizeHist(g) #spreads pixel values across entire spectrum
_,t = cv2.threshold(histeq,150,225,cv2.THRESH_BINARY) #thresholds histeq
cropped = t[int(0.4*h):int(.6*h), int(0.1*width):int(0.9*width)]
rotated = imutils.rotate_bound(cropped, angle)
resized = imutils.resize(rotated,height=resizing_height)
Some numbers work better than others - for example, '1' seems to have a lot of trouble. The numbers occurring after the '+' or '-' often don't show up, and the '+' often shows up as a '-'. I've played around with the threshold values a bit, too.
The last three parts are because my video sample I've been drawing from was slightly askew. I could try taking some better data to work with, and I could also try making my own training data over the standard 'letsgodigital' lang. I feel like I'm not doing the image processing in the best way though, and would appreciate some guidance.
I plan to use some degree of edge detection to autocrop to the display, but for now I've just been trying to keep it simple and manually get the results I want. I've uploaded sample images with various degrees of image processing applied at http://imgur.com/a/vnqgP. It's difficult because sometimes I get the exact right answer from tesseract, and other times get nothing. The camera or light levels haven't really changed though, which makes me think it's a problem with my training data. Any suggestions or direction on where I should go would be much appreciated!! Thank you
For reading seven segment digits, normal OCR programs like tesseract don't usually work too well because of the space between individual segments. You should try ssocr, which was made specifically for reading seven segment digits. However, your preprocessing will need to be better as ssocr expects the input to be a single row of seven segment digits.
References - https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
Usage example - http://www.instructables.com/id/Raspberry-Pi-Reading-7-Segment-Displays/