Align text for OCR

Align text for OCR - python

I am creating a database from historical records which I have as photographed pages from books (+100K pages). I wrote some python code to do some image processing before I OCR each page. Since the data in these books does not come in well formatted tables, I need to segment each page into rows and columns and then OCR each piece separately.
One of the critical steps is to align the text in the image.
For example, this is a typical page that needs to be aligned:
A solution I found is to smudge the text horizontally (I'm using skimage.ndimage.morphology.binary_dilation) and find the rotation that maximizes the sum of white pixels along the horizontal dimension.
This works fine, but it takes about 8 seconds per page, which given the volume of pages I am working with, is way too much.
Do you know of a better, faster way of accomplishing aligning the text?
Update:
I use scikit-image for image processing functions, and scipy to maximize the count of white pixels along the horizontal axis.
Here is a link to an html view of the Jupyter notebook I used to work on this. The code uses some functions from a module I've written for this project so it cannot be run on its own.
Link to notebook (dropbox): https://db.tt/Mls9Tk8s
Update 2:
Here is a link to the original raw image (dropbox): https://db.tt/1t9kAt0z

Preface: I haven't done much image processing with python. I can give you an image processing suggestion, but you'll have to implement it in Python yourself. All you need is a FFT and a polar transformation (I think OpenCV has an in-built function for that), so that should be straightforward.
You have only posted one sample image, so I don't know if this works as well for other images, but for this image, a Fourier transform can be very useful: Simply pad the image to a nice power of two (e.g. 2048x2048) and you get a Fourier spectrum like this:
I've posted a intuitive explanation of the Fourier transform here, but in short: your image can be represented as a series of sin/cosine waves, and most of those "waves" are parallel or perpendicular to the document orientation. That's why you see a strong frequency response at roughly 0°, 90°, 180° and 270°. To measure the exact angle, you could take a polar transform of the Fourier spectrum:
and simply take the columnwise mean:
The peak position in that diagram is at 90.835°, and if I rotate the image by -90.835 modulo 90, the orientation looks decent:
Like I said, I don't have more test images, but it works for rotated versions of your image. At the very least it should narrow down the search space for a more expensive search method.
Note 1: The FFT is fast, but it obviously takes more time for larger images. And sadly the best way to get a better angle resolution is to use a larger input image (i.e. with more white padding around the source image.)
Note 2: the FFT actually returns an image where the "DC" (the center in the spectrum image above) is at the origin 0/0. But the rotation property is clearer if you shift it to the center, and it makes the polar transform easier, so I just showed the shifted version.

This is not a full solution but there is more than a comment's worth of thoughts.
You have a margin on the left and right and top and bottom of your image. If you remove that, and even cut into the text in the process, you will still have enough information to align the image. So, if you chop, say 15%, off the top, bottom, left and right, you will have reduced your image area by 50% already - which will speed things up down the line.
Now take your remaining central area, and divide that into, say 10 strips all of the same height but the full width of the page. Now calculate the mean brightness of those strips and take the 1-4 darkest as they contain the most (black) lettering. Now work on each of those in parallel, or just the darkest. You are now processing just the most interesting 5-20% of the page.
Here is the command to do that in ImageMagick - it's just my weapon of choice and you can do it just as well in Python.
convert scan.jpg -crop 300x433+64+92 -crop x10# -format "%[fx:mean]\n" info:
0.899779
0.894842
0.967889
0.919405
0.912941
0.89933
0.883133 <--- choose 4th last because it is darkest
0.889992
0.88894
0.888865
If I make separate images out of those 10 stripes, I get this
convert scan.jpg -crop 300x433+64+92 -crop x10# m-.jpg
and effectively, I do the alignment on the fourth last image rather than the whole image.
Maybe unscientific, but quite effective and pretty easy to try out.
Another thought, once you have your procedure/script sorted out for straightening a single image, do not forget you can often get massive speedup by using GNU Parallel to harass all your CPU's lovely, expensive cores simultaneously. Here I specify 8 processes to run in parallel...
#!/bin/bash
for ((i=0;i<100000;i++)); do
ProcessPage $i
done | parallel --eta -j 8

"align the text in the image" I suppose means to deskew the image so that text lines have the same baseline.
I thoroughly enjoyed reading scientific answers to this quite overengineered task. Answers are great, but is it really necessary to spend so much time (very precious resource) to implement this? There is an abundance of tools available for this function without needing to write a single line of code (unless OP is a CS student and wants to practice the science, but obviously OP is doing this out of necessity to get all images processed). These methods took me back to my college years, but today I would use different tools to process this batch quickly and efficiently, which I do daily. I work for a high-volume document conversion and data extraction service bureau and OCR consulting company.
Here is the result of a basic open and deskew step in ABBYY FineReader commercial desktop OCR package. Deskewing was more than sufficient for further OCR processing.
And I did not need to recreate and program my own browser just to post this answer.

Related

Eliminate the background (the common points) of 3 images - OpenCV

Forgive me but I'm new in OpenCV.
I would like to delete the common background in 3 images, where there is a landscape and a man.
I tried some subtraction codes but I can't solve the problem.
I would like output each image only with the man and without landscape
Are there in OpenCV Algorithms what do this do? (then without any manual operation so no markers or other)
I tried this python code CV - Extract differences between two images
but not works because in my case i don't have an image with only background (without man).
I thinks that good solution should to Compare all the images and save those "points" that are the same at least in an image.
In this way I can extrapolate a background (which we call "Result.jpg") and finally analyze each image and cut those portions that are also present in "Result.jpg".
You say it's a good idea? Do you have other simplest ideas?

Without semantic segmentation, you can't do that.
Because all you can compute is where two images differ, and this does not give you the silhouette of the person, but an overlapping of two silhouettes. You'll never know the exact outline.

Remove differences between two video frames

Im trying to remove the differences between two frames and keep the non-chaning graphics. Would probably repeat the same process with more frames to get more accurate results. My idea is to simplify the frames removing things that won't need to simplify the rest of the process that will do after.
The different frames are coming from the same video so no need to deal with different sizes, orientation, etc. If the same graphic its in another frame but with a different orientation or scale, I would like to also remove it. For example:
Image 1
Image 2
Result (more or less, I suppose that will be uglier but containing a similar information)
One of the problems of this idea is that the source video, even if they are computer generated graphics, is compressed so its not that easy to identify if a change on the tonality of a pixel its actually a change or not.
Im ideally not looking at a pixel level and given the differences in saturation applied by the compression probably is not possible. Im looking for unchaged "objects" in the image. I want to extract the information layer shown on top of whats happening behind it.
During the last couple of days I have tried to achieve it in a Python script by using OpenCV with all kinds of combinations of absdiffs, subtracts, thresholds, equalizeHists, canny but so far haven't found the right implementation and would appreciate any guidance. How would you achieve it?

Im ideally not looking at a pixel level and given the differences in saturation applied by the compression probably is not possible. Im looking for unchaged "objects" in the image. I want to extract the information layer shown on top of whats happening behind it.
This will be extremely hard. You would need to employ proper CV and if you're not an expert in that field, you'll have really hard time.
How about this, forgetting about tooling and libs, you have two images, ie. two equally sized sequences of RGB pixels. Image A and Image B, and the output image R. Allocate output image R of the same size as A or B.
Run a single loop for every pixel, read pixel a and from A and pixel b from B. You get a 3-element (RGB) vector. Find distance between the two vectors, eg. magnitude of a vector (b-a), if this is less than some tolerance, write either a or b to the same offset into result image R. If not, write some default (background) color to R.
You can most likely do this with some HW accelerated way using OpenCV or some other library, but that's up to you to find a tool that does what you want.

how to locate and extract coordinates/data/sub-components of charts/map image data?

I'm working on creating a tile server from some raster nautical charts (maps) i've paid for access, and i'm trying to post-process the raw image data that these charts are distributed as, prior to geo-referencing them and slicing them up into tiles
i've got a two sets of tasks and would greatly appreciate any help or even sample code on how to get these done in an automated way. i'm no stranger to python/jupyter notebooks but have zero experience with this type of data-science to do image analysis/processing using things like opencv/machine learning (or if there's a better toolkit library that i'm not even yet aware of).
i have some sample images (originals are PNG but too big to upload so i encoded them in high-quality JPEGs to follow along/provide sample data).. here's what i'm trying to get done:
validation of all image data.. the first chart (as well as last four) demonstrate what properly formatted charts images should looks like (i manually added a few colored rectangles to the first, to highlight different parts of the image in the bonus section below)
some images will either have missing tile data, as in the 2nd sample image, these are ALWAYS chunks of 256x256 image data, so should be straightforward to identify black boxes of this exact size..
some images will have corrupt/misplaced tiles as in the 3rd image (notice in the center/upper half of the image is a large colorful semi-circle/arcs, it is slightly duplicated beneath and if you look along horizontally you can see the image data is shifted and so these tiles have been corrupted somehow
extraction of information, ultimately once all image data is verified to be valid (the above steps are ensured), there is a few bit of data i really need pulled out of the image, the most important of which is
the 4 coordinates (upper left, upper right, lower left, lower right) of the internal chart frame, in the first image they are highlighted in a small pink box at each corner (the other images don't have them but they are located in a simlar way) - NOTE, because these are geographic coordinates and involve projections, they are NOT always 100% horizontal/vertical of each other.
the critical bit is that SOME images container more than one "chartlet", i really need to obtain the above 4 coordinate for EACH chartlet (some charts have no chartlets, some two to several of them, and they are not always simple rectangular shapes), i may be able to generate for input the number of chartlets if that helps..
if possible, what would also help is extracting each chartlet as a separate image (each of these have a single capital letter, A, B, C in a circle that would be good if it appeared in the filename)
as a bonus, if there was a way to also extract the sections sampled in the first sample image (in the lower left corner), this would probably involve recognize where/if in the image this appears (would probably only appear once per file but not certain) and then extracting based on its coordinates?
mainly the most important is inside a green box and represents a pair of tables (the left table is an example and i believe would always be the same, and the right has a variable amount of columns)
also the table in the orange box would be good to also get the text from as it's related
as would the small overview map in the blue box, can be left as an image
i have been looking at tutorials on opencv and image recognition processes but the content so far has been highly elementary not to mention an overwhelming endless list of algorithms for different operations (which again i don't know which i'd even need), so i'm not sure how it relates to what i'm trying to do.. really i don't even know where to begin to structure the steps needed for undertaking all these tasks or how each should be broken down further to ease the processing.

Extracting text from scanned engineering drawings

I'm trying to extract text from a scanned technical drawing. For confidentiality reasons, I cannot post the actual drawing, but it looks similar to this, but a lot busier with more text within shapes. The problem is quite complex due to issues with letters touching both each other and it's surrounding borders / symbols.
I found an interesting paper that does exactly this called "Detection of Text Regions From Digital Engineering Drawings" by Zhaoyang Lu. It's behind a paywall so you might not be able to access it, but essentially it tries to erase everything that's not text from the image through mainly two steps:
1) Erases linear components, including long and short isolated lines
2) Erases non-text strokes in terms of analysis of connected components of strokes
What kind of OpenCV functions would help in performing these operations? I would rather not write something from the ground up to do these, but I suspect I might have to.
I've tried using a template-based approach to try to isolate the text, but since the text location isn't completely normalized between drawings (even in the same project), it fails in detecting text past the first scanned figure.

I am working on a similar problem. Technical drawings are an issue because OCR software mostly tries to find text baselines and the drawing artifacts (lines etc) get in the way of that approach. In the drawing you specified there are not many characters touching each other. So I suggest to break the image into contiguous (black) pixels and then scan those individually. The height of the contiguous areas should give you also an indication if the contiguous area is text, or a piece of the drawing. To break the image into contiguous pixels, use a flood fill algorithm, and for the scanning Tesseract does a good job.

Obviously I've never attempted this specific task, however if the image really looks like the one you showed me I would start by removing all vertical and horizontal lines. This could be done pretty easily, just set a width threshold and for all pixels with intensity larger than some N value, and after that look the threshold amount of pixels perpendicular to the hypothethic line orientation. If it looks like a line erase it.
More elegant and perhaps better would be to do a hough transform for lines and circles and remove those elements that way.
Also you could maybe try some FFT based filtering, but I'm not so sure about that.
I've never used OpenCV but i would guess it can do the things i mentioned.

Robust detection of grid pattern in an image

I have written a program in Python which automatically reads score sheets like this one
At the moment I am using the following basic strategy:
Deskew the image using ImageMagick
Read into Python using PIL, converting the image to B&W
Calculate calculate the sums of pixels in the rows and the columns
Find peaks in these sums
Check the intersections implied by these peaks for fill.
The result of running the program is shown in this image:
You can see the peak plots below and to the right of the image shown in the top left. The lines in the top left image are the positions of the columns and the red dots show the identified scores. The histogram bottom right shows the fill levels of each circle, and the classification line.
The problem with this method is that it requires careful tuning, and is sensitive to differences in scanning settings. Is there a more robust way of recognising the grid, which will require less a-priori information (at the moment I am using knowledge about how many dots there are) and is more robust to people drawing other shapes on the sheets? I believe it may be possible using a 2D Fourier Transform, but I'm not sure how.
I am using the EPD, so I have quite a few libraries at my disposal.

First of all, I find your initial method quite sound and I would have probably tried the same way (I especially appreciate the row/column projection followed by histogramming, which is an underrated method that is usually quite efficient in real applications).
However, since you want to go for a more robust processing pipeline, here is a proposal that can probably be fully automated (also removing at the same time the deskewing via ImageMagick):
Feature extraction: extract the circles via a generalized Hough transform. As suggested in other answers, you can use OpenCV's Python wrapper for that. The detector may miss some circles but this is not important.
Apply a robust alignment detector using the circle centers.You can use Desloneux parameter-less detector described here. Don't be afraid by the math, the procedure is quite simple to implement (and you can find example implementations online).
Get rid of diagonal lines by a selection on the orientation.
Find the intersections of the lines to get the dots. You can use these coordinates for deskewing by assuming ideal fixed positions for these intersections.
This pipeline may be a bit CPU-intensive (especially step 2 that will proceed to some kind of greedy search), but it should be quite robust and automatic.

The correct way to do this is to use Connected Component analysis on the image, to segment it into "objects". Then you can use higher level algorithms (e.g. hough transform on the components centroids) to detect the grid and also determine for each cell whether it's on/off, by looking at the number of active pixels it contains.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.