I think i am looking for something simpler than detecting a document boundaries in a photo. I am only trying to flag photos which are mostly of documents rather than just a normal scene photo. is this an easier problem to solve?
Are the documents mostly white? If so, you could analyse the images for white content above a certain percentage. Generally text documents only have about 10% printed content on them in total.
Related
I am working on a document watermarking and recovery project. My code works fine for embedding the watermark and then recovering it. It works great with digitally generated images that haven't been scaled or modified in any way. But, when I use scanned images or modified, The recovered portion is good enough to be read when not zoomed but not great. The problem is that the original images have fringes of colour around the text. Those chunks of colourful pixels cause a bad average to be stored. This average when use for recovery, cause the recovered text, signatures or any other artefacts to have bad colours. Below are some images representing the problem. On the left is the recovered portion while the right has untampered portion. The untampered portion shows how the original image had bad colours all around the text.
How do I remove that bad colouring or blocks?
Any ideas about removing them from the recovered part or the original image itself?
Hi I tried to use pdfimages to extract ID images from my pdf resume files. However for some files they return also the icon, table lines, border images which are totally irrelevant.
Is there anyway I can limit it to only extract person photo? I am thinking if we can define a certain size constraints on the output?
You need a way of differentiating images found in the PDF in order to extract the ones of interest.
I believe you have the options of considering:
Image characteristics such as Width, Height, Bits Per Component, ColorSpace
Metadata information about the image (e.g. a XMP tag of interest)
Facial recognition of the person in the photo or Form recognition of the structure of the ID itself.
Extracting all of the images and then use some image processing code to analyze the images to identify the ones of interest.
I think 2) may be the most reliable method if the author of the PDF included such information with the photo IDs. 3) may be difficult to implement and get a reliable result from consistently. 1) will only work if that is a reliable means of identifying such photo IDs for your PDF documents.
Then you could key off of that information using your extraction tool (if it lets you do that). Otherwise you would need to write your own extraction tool using a PDF library.
i want to detect the font of text in an image so that i can do better OCR on it. searching for a solution i found this post. although it may seem that it is the same as my question, it does not exactly address my problem.
background
for OCR i am using tesseract, which uses trained data for recognizing text. training tesseract with lots of fonts reduces the accuracy which is natural and understandable. one solution is to build multiple trained data - one per few similar fonts - and then automatically use the appropriate data for each image. for this to work we need to be able to detect the font in image.
number 3 in this answer uses OCR to isolate image of characters along with their recognized character and then generates the same character's image with each font and compare them with the isolated image. in my case the user should provide a bounding box and the character associated with it. but because i want to OCR Arabic script(which is cursive and character shapes may vary depending on what other characters are adjacent to it) and because the bounding box may not be actually the minimal bounding box, i am not sure how i can do the comparing.
i believe Hausdorff distance is not applicable here. am i right?
shape context may be good(?) and there is a shapeContextDistanceExtractor class in opencv but i am not sure how i can use it in opencv-python
thank you
sorry for bad English
I'm trying to extract text from a scanned technical drawing. For confidentiality reasons, I cannot post the actual drawing, but it looks similar to this, but a lot busier with more text within shapes. The problem is quite complex due to issues with letters touching both each other and it's surrounding borders / symbols.
I found an interesting paper that does exactly this called "Detection of Text Regions From Digital Engineering Drawings" by Zhaoyang Lu. It's behind a paywall so you might not be able to access it, but essentially it tries to erase everything that's not text from the image through mainly two steps:
1) Erases linear components, including long and short isolated lines
2) Erases non-text strokes in terms of analysis of connected components of strokes
What kind of OpenCV functions would help in performing these operations? I would rather not write something from the ground up to do these, but I suspect I might have to.
I've tried using a template-based approach to try to isolate the text, but since the text location isn't completely normalized between drawings (even in the same project), it fails in detecting text past the first scanned figure.
I am working on a similar problem. Technical drawings are an issue because OCR software mostly tries to find text baselines and the drawing artifacts (lines etc) get in the way of that approach. In the drawing you specified there are not many characters touching each other. So I suggest to break the image into contiguous (black) pixels and then scan those individually. The height of the contiguous areas should give you also an indication if the contiguous area is text, or a piece of the drawing. To break the image into contiguous pixels, use a flood fill algorithm, and for the scanning Tesseract does a good job.
Obviously I've never attempted this specific task, however if the image really looks like the one you showed me I would start by removing all vertical and horizontal lines. This could be done pretty easily, just set a width threshold and for all pixels with intensity larger than some N value, and after that look the threshold amount of pixels perpendicular to the hypothethic line orientation. If it looks like a line erase it.
More elegant and perhaps better would be to do a hough transform for lines and circles and remove those elements that way.
Also you could maybe try some FFT based filtering, but I'm not so sure about that.
I've never used OpenCV but i would guess it can do the things i mentioned.
I have a bunch of scanned images of documents of the same layout (strict forms filled out with variable data) that I need to process with OCR. I can more or less cope with the OCR process itself (convert text images to text) but still have to cope with the annoying fact that the scanned images are distorted either by different degree of rotation, different scaling or both.
Because my method focuses on reading pieces of information from respective cells that are defined as bounding boxes by pixels, I must convert all pictures to a "standard" version where every corresponding cells are in the same pixel position, otherwise my reader "misreads". My question is, how could I "normalize" the distorted images?
I use Python.
Today in high-volume form-scanning jobs we use commercial software with adaptive template matching, which does deskew and selective binarization to prepare the images, but then it adapts field boxes per image, not placing boxes on XY-location.
Deskeing process overall increases the image size. It is visible in this random image from online search:
https://github.com/tesseract-ocr/tesseract/wiki/skew-linedetection.png
Notice how the title of the document was near the top border, and in the deskewed image it is shifted down. In this oversimplified example an XY-based box would not catch it.
I use commercial software for deskewing and image pre-processing. It is quite inexpensive but good. Unfortunately, I believe it will take you only part-way if the data capture method relies on xy-coordinate field matching. I sense your frustration with dealing with it, thus appropriate tools were already created for handling that.
I run a service bureau for such form processing. If you are interested I can further share privately methods how we process.then