Reading CAPTCHA using tesseract is giving wrong readings

Reading CAPTCHA using tesseract is giving wrong readings - python

from urllib import urlopen,urlretrieve
from PIL import Image,ImageOps
from bs4 import BeautifulSoup
import requests
import subprocess
def cleanImage(imagePath):
image=Image.open(imagePath)
image=image.point(lambda x:0 if x<143 else 255)
borederImage=ImageOps.expand(image,border=20,fill="white")
borederImage.save(imagePath)
html=urlopen("http://www.pythonscraping.com/humans-only")
soup=BeautifulSoup(html,'html.parser')
imageLocation=soup.find('img',{'title':'Image CAPTCHA'})['src']
formBuildID=soup.find('input',{'name':'form_build_id'})['value']
captchaSID=soup.find('input',{'name':'captcha_sid'})['value']
captchaToken=soup.find('input',{'name':'captcha_token'})['value']
captchaURL="http://pythonscraping.com"+imageLocation
urlretrieve(captchaURL,"captcha.jpg")
cleanImage("captcha.jpg")
p=subprocess.Popen(['tesseract','captcha.jpg',"captcha"],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
p.wait()
f=open('captcha.txt','r')
captchaResponce=f.read().replace(" ","").replace("\n","")
print "captcha responce attempt "+ captchaResponce+"\n\n"
try:
print captchaResponce
print len(captchaResponce)
print type(captchaResponce)
except:
print "No way"
Hello
This is my code for a testing site to download the CAPTCHA image(each time you open site you'll get a different CAPTCHA),then read it using tesseract in python.
I have tried to download the image directly and read it directly using tesseract it didn't get the correct CAPTCHA reading,so i added the function cleanImage to help but also it didn't read it correctly.
After searching online, my problem seems to be with tesseract not being "trained" to process the images correctly.
Any help is much appreciated.
**this code is from web-scraping book ,also this example purpose is to read the CAPTCHA &submit the form. This is in no way an attack or offensive tool to overload or harm the site.

I used tesseract to solve captchas with nodejs. To get it running you need to do some image proccessing first (Depending on the captcha you try to solve).
If you take this type of captcha for example I did:
Remove "white noise"
Remove gray lines
Remove gray dots
Fill gaps
Change to grayscale image
NOW do OCR with tesseract
You can check out the code, how its done, and more docu here: https://github.com/cracker0dks/CaptchaSolver

Tesseract was trained to do more conventional OCR, and CAPTCHA is very challenging for it as is, because characters are not aligned, may have rotation, overlap and differ in size and fonts. You should try to invoke tesseract with different page segmentaion mode (--psm option). Here is a list of all possible values:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Try modes with OSD (like 1, 7, 11, 12, 13). This will improve you recognition rate. But in order to really improve, you will have to write a program that finds separate letters (segments the image) and sends them to tesseract one by one (using --psm 10). opencv is a great library for image manipulation. This post may be a good start.
Regarding concerns about legitimacy of CAPTCHA recognition: it is ethical problem and lays beyond the scope of SO. Pythonscraping is a classic testing site and I see no problem whatsoever to assist solving the question. This concern is the same as teaching self-defense may be used to attack people.
Anyways, CAPTCHA is a very weak human-confirmation challenge and no site should be using it nowadays, while reCAPTCHA is much stronger, friendlier and free.

Related

"Newbie" questions about PyTesseract OCR

I'm working on an application that would extract information from invoices that the user takes a picture of with his phone (using flask and pytesseract).
Everything works on the extraction and classification side for my needs, using the image_to_data method of pytesseract.
But the problem is on the pre-processing side.
I refine the image with greyscale filters, binarization, dilation, etc.
But sometimes the user will take a picture that has a specific angle, like this:
invoice
And then tesseract will return characters that don't make sense, or sometimes it will just return nothing.
At the moment I "scan" the image during pre-processing (I'm largely inspired by this tutorial: https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/), but it's not efficient at all.
Does anyone know a way to make it easier for tesseract to work on these type of images?
If not, should I focus on making this pre-processing scan thing?
Thank you for your help!

How to get a clean screenshot?

I am working on an automatation program using tensorflow. But i need some data to bypass text based CAPTCHA and i try to gather some data(images actually) from sites. How can i take "clean" screenshots with the help of OpenCV. With "clean" i mean images without white blanks.
Note: I know that we can take screenshot of desired web element using selenium (refer to: https://www.lambdatest.com/blog/screenshots-with-selenium-webdriver/) but in this site there are two text based CAPTCHAs so the screenshot also include white blanks, which ı don't want to have. I also tried capturing images manually but because of my not sensitive hands images also include white blanks.
When I was trying to get the web element using selenium. I was not satisfied with the result because it has white blanks, which I don't want in my dataset
Normally the images look like that. All I want is getting two seperate images without a white blank
All I want is getting two seperate images without a white blank in order to use in my data for training. Could you please help me?

You could use Playwright and take an element screenshot with omitBackground enabled: https://playwright.dev/#version=v1.0.2&path=docs%2Fapi.md&q=elementhandlescreenshotoptions

tesseract unable to read image tile

i am writing a program to extract information from Govt. ID's and use contours to extract characters from the image (since passing it as is, into tesseract results in junk output). I have tried this approach with other printed documents and it yields much better results than passing the whole image into tesseract. But for some reason, with the govt id images, it seems to either give "Error in pixGenHalftoneMask: pix too small" (though the images are at least 100x100) or blank output. I am attaching both images used
I used psm 10 for the image with an "M" and 8 for the other image and yet it bore no fruit. I have a feeling the character is just too thick ? i tried erosion as well ..but the end result was the same :( .. any pointers please ?

Extract text from light text on withe background image

I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?

You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).

Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.

The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.

You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.

CAPTCHAs Image Manipulation using Pillow

As an exercise, I'm attempting to break the following CAPTCHA:
It doesn't seem like it would be too difficult to break as the edges seems to fairly solid and noise should be relatively easy to remove. Problem is, I have very little experience with image manipulation. Currently I'm using Python with the Pillow library to manipulate the CAPTCHA image, after which it will be passed into Tesseract for OCR.
In the following code I attempt to bring out the edges by sharpening the image and the convert the image to black and white
from PIL import Image, ImageFilter
try:
img = Image.open("Captcha.jpg")
except:
print("Can't load captcha.")
exit()
# Bring out the edges by sharpening.
out = img.filter(ImageFilter.SHARPEN)
out = out.convert("L")
out = out.point(lambda x: 0 if x<136 else 255, "1")
width, height = out.size
out = out.resize((width*5, height*5), Image.NEAREST)
out.save("captcha_modified.png")
At this point I see the following:
However, Tesseract is still unable to read the characters. As an experiment, I used good ol' mspaint to manually modify the image to a point to where it could be read by Tesseract:
So if can get the image to that point, I think Tesseract will do a fairly good job at detecting characters. So my current thoughts are that I need to enhance the edges and reduce the noise the image. Also, I imagine it would be easier for Tesseract to detect the letters if the letters will filled in rather than outlined, but I have not idea how I'd do this.
Any suggestions on how to go about this? Is there a better way to process the images?

I am short on time so this answer may not be incredibly useful but goes over my own 2 algorithms exactly. There isn't much code but a few method reccomendations. It is a good idea to use code rather than MS Paint.With code its actually really easy to break a captcha and achieve above 50% success rate. Behavioral recognition may be a better security mechanism or may be an additional one.
A. Edge Detection Method you use:
Edge detection really isn't necessary. In this case, just use the getpixel((x,y)) function and fill in the area between the bounding lines, recognizing to fill at lines 1,3,5;etc. and turn off the fill after intersection 2,4,6;etc. Luckilly, you chose an easy Captcha so edge detection is a decent solution without decluttering,rotating, and re-alignment.
B. Manipulation Method:
Another method I use utilizes OpenCV and pillow as well. I am really busy but am posting a blog article on this later at druid5.wordpress.com/ which will contain code examples of this method. Since it isn't illegal to get through them, at least I am told, I use the method I will post to collect data all the time. Mostly, contrast and detail from pillow, some basic clutter removal with stats, re-alignment with a basic dfs, and rotation (performable with opencv or easily with a kernal). Tesseract is a good choice for open source but it isn't too hard to create an OCR with opencv either.
This exercies is a decent introduction to OpenCV, PIL (pillow), image manipulation with math, and some other things that help with everything from robotics to AI.
Using flow control to find the failed conditions and try different routes may be necessary but the aim should always be a generic solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.