How to get a clean screenshot? - python

I am working on an automatation program using tensorflow. But i need some data to bypass text based CAPTCHA and i try to gather some data(images actually) from sites. How can i take "clean" screenshots with the help of OpenCV. With "clean" i mean images without white blanks.
Note: I know that we can take screenshot of desired web element using selenium (refer to: https://www.lambdatest.com/blog/screenshots-with-selenium-webdriver/) but in this site there are two text based CAPTCHAs so the screenshot also include white blanks, which ı don't want to have. I also tried capturing images manually but because of my not sensitive hands images also include white blanks.
When I was trying to get the web element using selenium. I was not satisfied with the result because it has white blanks, which I don't want in my dataset
Normally the images look like that. All I want is getting two seperate images without a white blank
All I want is getting two seperate images without a white blank in order to use in my data for training. Could you please help me?

You could use Playwright and take an element screenshot with omitBackground enabled: https://playwright.dev/#version=v1.0.2&path=docs%2Fapi.md&q=elementhandlescreenshotoptions

Related

how to delete a text layer using fitz?

This is a very straightforward issue. I added an invisible text layer using page.insert_text().
After saving the modified pdf, I can use page.get_text() to retrieve the created text layer.
I would like to be able to eliminate that layer, buy couldn't find a function to do it.
The solution I've came up with is taking the pages as images and create a new pdf. But it seems like a very inefficient solution.
I would like to be able to solve this issue without using a different library other than fitz and it feels like it should be a solution within fitz, considering that page.get_text() can access the exact information I'm trying to eliminate
If you are certain of the whereabouts of your text on the page (and I understood that you are), simply use PDF redactions:
page.add_redact_annot(rect1) # remove text inside this rectangle
page.add_redact_annot(rect2)
...
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# the above removes everything intersecting any of the rects,
# but leaves images untouched
Obviously you can remove all text on the page by taking page.rect as the redaction rectangle.

What could be the solution for automatically Document Image Unwarping caused by 3d warping?

I want to make some kind of python script that can do that.
In my case i just want very simple unwarping as follow
Always having similar backgroud
Always Placing page at similar position
Always having same type of wraped image
I tried following methods but didn't work out.
I tried so many scanning apps but no app can unwarp 3d wrap
for example this one microsoft office lens
.
I tried page_dewarp.py. But it does not work with pages having spaces between texts or having segments of texts and most of times for that kind of images it just revert cure from left to right or vice versa and also unable to detect actual text area for example
I found deep-learning-for-document-dewarping that is trying to solve this problem by using pix2pixHD But i am not sure this is gona work and this project don't have trained models and currectly not solving the problem. should i train a model with just following training data train_A - warped input images and train_B - unwarped output images as mentioned at pix2pixHD. I can generate training data by make warped and uwarped images using blender 3d. In this way i can generate so many images by using some scanned book's pages by just rendering uwarped image and warped image it like someone taking photos of pages but virtually.

Reading CAPTCHA using tesseract is giving wrong readings

from urllib import urlopen,urlretrieve
from PIL import Image,ImageOps
from bs4 import BeautifulSoup
import requests
import subprocess
def cleanImage(imagePath):
image=Image.open(imagePath)
image=image.point(lambda x:0 if x<143 else 255)
borederImage=ImageOps.expand(image,border=20,fill="white")
borederImage.save(imagePath)
html=urlopen("http://www.pythonscraping.com/humans-only")
soup=BeautifulSoup(html,'html.parser')
imageLocation=soup.find('img',{'title':'Image CAPTCHA'})['src']
formBuildID=soup.find('input',{'name':'form_build_id'})['value']
captchaSID=soup.find('input',{'name':'captcha_sid'})['value']
captchaToken=soup.find('input',{'name':'captcha_token'})['value']
captchaURL="http://pythonscraping.com"+imageLocation
urlretrieve(captchaURL,"captcha.jpg")
cleanImage("captcha.jpg")
p=subprocess.Popen(['tesseract','captcha.jpg',"captcha"],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
p.wait()
f=open('captcha.txt','r')
captchaResponce=f.read().replace(" ","").replace("\n","")
print "captcha responce attempt "+ captchaResponce+"\n\n"
try:
print captchaResponce
print len(captchaResponce)
print type(captchaResponce)
except:
print "No way"
Hello
This is my code for a testing site to download the CAPTCHA image(each time you open site you'll get a different CAPTCHA),then read it using tesseract in python.
I have tried to download the image directly and read it directly using tesseract it didn't get the correct CAPTCHA reading,so i added the function cleanImage to help but also it didn't read it correctly.
After searching online, my problem seems to be with tesseract not being "trained" to process the images correctly.
Any help is much appreciated.
**this code is from web-scraping book ,also this example purpose is to read the CAPTCHA &submit the form. This is in no way an attack or offensive tool to overload or harm the site.
I used tesseract to solve captchas with nodejs. To get it running you need to do some image proccessing first (Depending on the captcha you try to solve).
If you take this type of captcha for example I did:
Remove "white noise"
Remove gray lines
Remove gray dots
Fill gaps
Change to grayscale image
NOW do OCR with tesseract
You can check out the code, how its done, and more docu here: https://github.com/cracker0dks/CaptchaSolver
Tesseract was trained to do more conventional OCR, and CAPTCHA is very challenging for it as is, because characters are not aligned, may have rotation, overlap and differ in size and fonts. You should try to invoke tesseract with different page segmentaion mode (--psm option). Here is a list of all possible values:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Try modes with OSD (like 1, 7, 11, 12, 13). This will improve you recognition rate. But in order to really improve, you will have to write a program that finds separate letters (segments the image) and sends them to tesseract one by one (using --psm 10). opencv is a great library for image manipulation. This post may be a good start.
Regarding concerns about legitimacy of CAPTCHA recognition: it is ethical problem and lays beyond the scope of SO. Pythonscraping is a classic testing site and I see no problem whatsoever to assist solving the question. This concern is the same as teaching self-defense may be used to attack people.
Anyways, CAPTCHA is a very weak human-confirmation challenge and no site should be using it nowadays, while reCAPTCHA is much stronger, friendlier and free.

Extract text from light text on withe background image

I have an image like the following:
and I would want to extract the text from it, that should be ws35, I've tried with pytesseract library using the method :
pytesseract.image_to_string(Image.open(path))
but it returns nothing... Am I doing something wrong? How can I get back the text using the OCR ? Do I need to apply some filter on it ?
You can try the following approach:
Binarize the image with a method of your choice (Thresholding with 127 seems to be sufficient in this case)
Use a minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
If necessary the result can be further improved via application of a median blur (r=4):
Because i personally do not use tesseract i am not able to try this picture, but online ocr tools seem to be able to identify the sequence correctly (especially if you use the blurred version).
Similar to #SilverMonkey's suggestion: Gaussian blur followed by Otsu thresholding.
The problem is that this picture is low quality and very noisy!
even proffesional and enterprisal programs are struggling with this
you have most likely seen a capatcha before and the reason for those is because its sent back to a database with your answer and the image and then used to train computers to read images like these.
short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.
You may need apply some image processing/enhancement on it. Look at this post read suggestions and try to apply.

Removing borders (lines) from image

I have this image (some information was deleted from this on purpose)
What I need is some kind of way to remove the borders(lines) around the text.
I am doing OCR on these images and the lines are really in the way for text recognition.
Also everything has to work automatically, OCR and all other scripts get executed on the server side when someone uploads a document.
You could try using a Hough transform to detect all straight lines in the image, then all you need to do is mask them.
You can use Leptonica to remove lines.
http://www.leptonica.com/line-removal.html
https://github.com/DanBloomberg/leptonica/blob/master/prog/lineremoval_reg.c

Categories

Resources