How to extract only ID photo from CV with pdfimages - python

Hi I tried to use pdfimages to extract ID images from my pdf resume files. However for some files they return also the icon, table lines, border images which are totally irrelevant.
Is there anyway I can limit it to only extract person photo? I am thinking if we can define a certain size constraints on the output?

You need a way of differentiating images found in the PDF in order to extract the ones of interest.
I believe you have the options of considering:
Image characteristics such as Width, Height, Bits Per Component, ColorSpace
Metadata information about the image (e.g. a XMP tag of interest)
Facial recognition of the person in the photo or Form recognition of the structure of the ID itself.
Extracting all of the images and then use some image processing code to analyze the images to identify the ones of interest.
I think 2) may be the most reliable method if the author of the PDF included such information with the photo IDs. 3) may be difficult to implement and get a reliable result from consistently. 1) will only work if that is a reliable means of identifying such photo IDs for your PDF documents.
Then you could key off of that information using your extraction tool (if it lets you do that). Otherwise you would need to write your own extraction tool using a PDF library.

Related

OCR / ICR for handwritting and logos

I have to do the following task on Python and I have not idea where
to begin:
OCR of handwritten dates
Page/document orientation detection for pretreatment
Stamp and Logo Detection and classification
a. Orientation variation included
b. Quality degradation to be considered
c. Overlying Primary Content
Anybody could help me?
THANKS IN ADVANCE¡¡
You can ocrmypdf to extract text from a pdf. It will extract text from the page and return a pdf same like original pdf with text on it. For detection of logos, you need to implement a computer vision-based model. if you need more details then please specify your requirement in details

Adding text search to content based image retrieval (convnet)

I've implemented CBIR app by using standard ConvNet approach:
Use Transfer Learning to extract features from the data set of images
Cluster extracted features via knn
Given search image, extract its features
Give top 10 images that are close to the image in hand in knn network
I am getting good results, but I want to further improve them by adding text search as well. For instance, when my image is the steering wheel of the car, the close results will be any circular objects that resemble a steering wheel for instance bike wheel. What would be the best possible way to input text say "car part" to produce only steering wheels similar to the search image.
I am unable to find a good way to combine ConvNet with text search model to construct improved knn network.
My other idea is to use ElasticSearch in order to do text search, something that ElasticSearch is good at. For instance, I would do a CBIR search described previously and out of the return results, I can look up their description and then use ElasticSearch on the subset of the hits to produce the results. Maybe tag images with classes and allow user to de/select groups of images of interest.
I don't want to do text search before image search as some of the images are poorly described so text search would miss them.
Any thoughts or ideas will be appreciated!
I have not found the original paper, but maybe you might find it interesting: https://www.slideshare.net/xavigiro/multimodal-deep-learning-d4l4-deep-learning-for-speech-and-language-upc-2017
It is about looking for a vector space where both images and text are (multimodal embedding). In this way you can find text similar to a images, images referring to a text, or use the tuple text / image to find similar images.
I think maybe this idea is an interesting point to start from.

how to detect if photo is mostly a document?

I think i am looking for something simpler than detecting a document boundaries in a photo. I am only trying to flag photos which are mostly of documents rather than just a normal scene photo. is this an easier problem to solve?
Are the documents mostly white? If so, you could analyse the images for white content above a certain percentage. Generally text documents only have about 10% printed content on them in total.

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

"Normalizing" (de-skewing, re-scaling) images as preprocessing for OCR in Python

I have a bunch of scanned images of documents of the same layout (strict forms filled out with variable data) that I need to process with OCR. I can more or less cope with the OCR process itself (convert text images to text) but still have to cope with the annoying fact that the scanned images are distorted either by different degree of rotation, different scaling or both.
Because my method focuses on reading pieces of information from respective cells that are defined as bounding boxes by pixels, I must convert all pictures to a "standard" version where every corresponding cells are in the same pixel position, otherwise my reader "misreads". My question is, how could I "normalize" the distorted images?
I use Python.
Today in high-volume form-scanning jobs we use commercial software with adaptive template matching, which does deskew and selective binarization to prepare the images, but then it adapts field boxes per image, not placing boxes on XY-location.
Deskeing process overall increases the image size. It is visible in this random image from online search:
https://github.com/tesseract-ocr/tesseract/wiki/skew-linedetection.png
Notice how the title of the document was near the top border, and in the deskewed image it is shifted down. In this oversimplified example an XY-based box would not catch it.
I use commercial software for deskewing and image pre-processing. It is quite inexpensive but good. Unfortunately, I believe it will take you only part-way if the data capture method relies on xy-coordinate field matching. I sense your frustration with dealing with it, thus appropriate tools were already created for handling that.
I run a service bureau for such form processing. If you are interested I can further share privately methods how we process.then

Categories

Resources