I want to extract table information from OCR Data - python

I want to extract table information from OCR data, I have raw text and it's text.
I tried pytesseract but couldn't find the actual Implementation.
Here is an image: https://drive.google.com/open?id=1CGJwbmf5snoXvwlQAsRAxIRRixbT_Q8l
I tried this: https://github.com/WZBSocialScienceCenter/pdftabextract
this method didn't work for me at all.
I want a tabular structure of this table from OCR data for my further processing.

pdftabextract is not an OCR. It requires scanned pages with OCR
information, i.e. a "sandwich PDF" that contains both the scanned
images and the recognized text. You need software like tesseract or
ABBYY Finereader for OCR.
Please try tesseract it has a relatively easier implementation.

Related

EasyOCR - Table extraction

I use easyocr to extract table from a photo or scanned PDF, but I have a problem in fine tuning the data as a table.
I try to make a searchable pdf according to extracted coordinates but when I convert it to csv, the lines are not tune.
I would appreciate if someone guide me about this.
As far as I know, easyocr currently does not support table recognition. The best table recognition should be PaddleOCR's PP-Structure model. This is what I use now, and the effect is very good. You can try it.
link: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/README.md

how to detect the unknow text language in a image?

I have N number of scanned images. the images contain different languages like Chinese, Arabic and Japanese. I tried to do OCR process for the files using OCRMYPDF and tesseract. both required the language of the input file. But I don't know the language of the file. So, Now my question is how to identify the languages used in a image ?

OCR / ICR for handwritting and logos

I have to do the following task on Python and I have not idea where
to begin:
OCR of handwritten dates
Page/document orientation detection for pretreatment
Stamp and Logo Detection and classification
a. Orientation variation included
b. Quality degradation to be considered
c. Overlying Primary Content
Anybody could help me?
THANKS IN ADVANCE¡¡
You can ocrmypdf to extract text from a pdf. It will extract text from the page and return a pdf same like original pdf with text on it. For detection of logos, you need to implement a computer vision-based model. if you need more details then please specify your requirement in details

How to extract only ID photo from CV with pdfimages

Hi I tried to use pdfimages to extract ID images from my pdf resume files. However for some files they return also the icon, table lines, border images which are totally irrelevant.
Is there anyway I can limit it to only extract person photo? I am thinking if we can define a certain size constraints on the output?
You need a way of differentiating images found in the PDF in order to extract the ones of interest.
I believe you have the options of considering:
Image characteristics such as Width, Height, Bits Per Component, ColorSpace
Metadata information about the image (e.g. a XMP tag of interest)
Facial recognition of the person in the photo or Form recognition of the structure of the ID itself.
Extracting all of the images and then use some image processing code to analyze the images to identify the ones of interest.
I think 2) may be the most reliable method if the author of the PDF included such information with the photo IDs. 3) may be difficult to implement and get a reliable result from consistently. 1) will only work if that is a reliable means of identifying such photo IDs for your PDF documents.
Then you could key off of that information using your extraction tool (if it lets you do that). Otherwise you would need to write your own extraction tool using a PDF library.

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

Categories

Resources