I'm new on the OCR world and I have document with numbers to analyse with Python, openCV and pytesserract.
The files I received are pdfs and the numbers are not text. So, I converted it to jpg with this :
first_page = convert_from_path(path__to_pdf, dpi=600, first_page=1, last_page=1)
first_page[0].save(TEMP_FOLDER+'temp.jpg', 'JPEG')
Then , the images look like this :
I still have some noise around the digits.
I tried to select the "black color only" with this :
img_hsv = cv2.cvtColor(img_raw, cv2.COLOR_BGR2HSV)
img_changing = cv2.cvtColor(img_raw, cv2.COLOR_RGB2GRAY)
low_color = np.array([0, 0, 0])
high_color = np.array([180, 255, 30])
blackColorMask = cv2.inRange(img_hsv, low_color, high_color)
img_inversion = cv2.bitwise_not(img_changing)
img_black_filtered = cv2.bitwise_and(img_inversion, img_inversion, mask = blackColorMask)
img_final_inversion = cv2.bitwise_not(img_black_filtered)
So, with this code, my image looks like this :
Even with cv2.blur, I don't even reach 75% of image FULLY analysed.
For at least 25% of the images, pytesseract misses 1 or more digits.
Is that normal ? Do you have ideas of what I can do to maximize the succesfull rate ?
Thanks
Whenever you see that Tesseract is missing a character or digit, think about page segmentation modes. If the character is not correct but was read, it is a recognition issue.
OCR engines split the text in the image we input, and this splitting is called page segmentation. Then, the engines try to recognize the text. Tesseract supports 13 page modes as follows:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
For your case, the best solution would be treating your image as a block to avoid missing any digits. Then, restrict the output to digits only to get a better result. Your code should be like this:
text = pytesseract.image_to_string(image, lang='eng',
config='--psm 6 -c tessedit_char_whitelist=0123456789')
print(text)
Output:
1821293045013
Your attempt to process a field entry was thwarted by "artifacts" see upper pair for my best result with your coloured source.
Normal advice is use greyscale but in this case that makes matters worse as there is background chatter.
You were right to attempt thresholding, as that will produce clearer results, however tesseract is prone to odd line and white space insertion when characters are not words.
I suggested you double check if there was no vector data in the file and it appears you uncovered an entry (annotation ?) that matched the data field.
Related
I am trying to get text from a video game using PIL and pytesseract. Here is an example of what I am trying to recognize :
I used a basic function to get binary image and another to invert it, here is the function :
#staticmethod
def getBinaryImage(image, thresh):
fn = lambda x: 255 if x > thresh else 0
im = image.convert('L').point(fn, mode='1')
return im
With these filters i manage to get this :
The problem is that tesseract can't recognize this. I tried with differents threshold for the binary image but it doesn't help.
My question is, is there any other basic filters i can apply to my image to make it better quality and make tesseract recognize it?
EDIT :
Here is a new version of my image, which was resized. But tesseract still can't recognize it.
I worked with tesseract and best way on OCR its train on your typography.
Another ways u can perform solution its using parameters:
--psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Then use --psm 10
Another way its:
--oem N
Specify OCR Engine mode. The options for N are:
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
This oem options are aviable on tesseract 4 and 5, in my experience i cant use this options cause errors
I wanted to use --oem 0,so i installed tesseract 3.2, which its last version of --oem 0, and i get better predictions
That my experience, try it.
I hope it works for you.
I'm trying to preprocess frames of a game in real-time for a ML project.
I want to extract numbers from the frame, so I chose Pytesseract, since it looked quite good with text.
Though, no matter how clear I make the text, it won't read it correctly.
My code looks like this:
section = process_screen(screen_image)[1]
pixels = rgb_to_bw(section) #Makes the image grayscale
pixels[pixels < 200] = 0 #Makes all non-white pixels black
tess.image_to_string(pixels)
=> 'ye ml)'
At best it outputs "ye ml)" when I don't specify I want digits, and when I do, it outputs nothing at all.
The non-processed game image looks like so:
The "pixels" image looks like so :
Thanks to Alex Alex, I inverted the image, and got this
And got "2710", which is better, but still not perfect.
You must invert the image before recognition.
I have recently started studying steganography and I've come across a problem that I just don't seem to understand. Basically, the image is a png which contains a hidden flag in it.
When you extract the bit planes from the image, you can see that there's an image in the blue and green planes that you can see in the red one. To reveal the flag in clear text, you have to remove those images from the red one by XORing the LSB or something. I am not totally sure.
This is what the image in the red plane looks like if you don't remove the others.
My question is how do I go about doing this kind of thing? This is the image in question.
Actually the hidden image is in the lowest 3 bit planes. Doing a full bit decomposition makes that clear.
Start by loading the image to a numpy array, which will have dimensions MxNx3.
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
img = Image.open('stego.png')
data = np.array(img)
All you have to do now is XOR each colour plane with another and then keep the 3 least significant bits (lsb).
extracted = (data[...,0] ^ data[...,1] ^ data[...,2]) & 0x07
plt.imshow(extracted)
plt.show()
In case it wasn't obvious, the & 0x07 part is an AND operation with the binary number 00000111, just written in hexadecimal for conciseness.
If you don't keep all 3 lsb, then you'll either be missing some letters in the solution, or everything will be there but some edges won't be as smooth. The first of these is critically important.
My question is not too far off from the "Image Alignment (ECC) in OpenCV ( C++ / Python )" article.
I also found the following article about facial alignment to be very interesting, but WAY more complex than my problem.
Wow! I can really go down the rabbit-hole.
My question is WAY more simple.
I have a scanned document that I have treated as a "template". In this template I have manually mapped the pixel regions that I require info from as:
area = (x1,y1,x2,y2)
such that x1<x2, y1<y2.
Now, these regions are, as is likely obvious, a bit too specific to my "template".
All other files that I want to extract data from are mostly shifted by some unknown amount such that their true area for my desired data is:
area = (x1 + ε1, y1 + ε2, x2 + ε1, y2 + ε2)
Where ε1, ε2 are unknown in advance.
But the documents are otherwise HIGHLY similar outside of this shift.
I want to discover, ideally through opencv, what translation is required (for the time being ignoring euclidean) to "align" these images as to disover my ε, shift my area, and parse my data directly.
I have thought about using tesseract to mine the text from the document and then parse from there, but there are check boxes that are either filled or empty
that contain meaningful information for my problem.
The code I currently have for cropping the image is:
from PIL import Image
img = Image.open(img_path)
area = area_lookup['key']
cropped_img = img.crop(area)
cropped_img.show()
My two sample files are attached.
My two images are:
We can assume my first image is my "template".
As you can see, the two images are very "similar" but one is moved slightly (human error). There may be cases where the rotation is more extreme, or the image is shifted more.
I would like transform image 2 to be as aligned to image 1 as possible, and then parse data from it.
Any help would be sincerely appreciated.
Thank you very much
I have a number of images from Chinese genealogies, and I would like to be able to programatically categorize them. Generally speaking, one type of image has primarily line-by-line text, while the other type may be in a grid or chart format.
Example photos
'Desired' type: http://www.flickr.com/photos/63588871#N05/8138563082/
'Other' type: http://www.flickr.com/photos/63588871#N05/8138561342/in/photostream/
Question: Is there a (relatively) simple way to do this? I have experience with Python, but little knowledge of image processing. Direction to other resources is appreciated as well.
Thanks!
Assuming that at least some of the grid lines are exactly or almost exactly vertical, a fairly simple approach might work.
I used PIL to find all the columns in the image where more than half of the pixels were darker than some threshold value.
Code
import Image, ImageDraw # PIL modules
withlines = Image.open('withgrid.jpg')
nolines = Image.open('nogrid.jpg')
def findlines(image):
w,h, = image.size
s = w*h
im = image.point(lambda i: 255 * (i < 60)) # threshold
d = im.getdata() # faster than per-pixel operations
linecolumns = []
for col in range(w):
black = sum( (d[x] for x in range(col, s, w)) )//255
if black > 450:
linecolumns += [col]
# return an image showing the detected lines
im2 = image.convert('RGB')
draw = ImageDraw.Draw(im2)
for col in linecolumns:
draw.line( (col,0,col,h-1), fill='#f00', width = 1)
return im2
findlines(withlines).show()
findlines(nolines).show()
Results
showing detected vertical lines in red for illustration
As you can see, four of the grid lines are detected, and, with some processing to ignore the left and right sides and the center of the book, there should be no false positives on the desired type.
This means that you could use the above code to detect black columns, discard those that are near to the edge or the center. If any black columns remain, classify it as the "other" undesired class of pictures.
AFAIK, there is no easy way to solve this. You will need a decent amount of image processing and some basic machine learning to classify these kinds of images (and even than it probably won't be 100% successful)
Another note:
While this can be solved by only using machine learning techniques, I would advice you to start searching for some image processing techniques first and try to convert your image to a form that has a decent difference for both images. For this you best start reading about the fft. After that have a look at some digital image processing techniques. When you feel comfortable that you have a decent understanding of these, you can read up on pattern recognition.
This is only one suggested approach though, there are more ways to achieve this.