I would like to be able to convert a word, for example "FOO" into a image file of that word, and to be able to control how much of the image the word takes up. I have tried googling this extensively, however all results are about how to convert a word document into a image, but what I need to do doesn't involve a document. Also I need to automate it for thousands of words so it needs to in Python or R, which are the two languages I know.
I've 2 text files. Writing a Python Program for the following
1)I need to compare the pairs of sentences in the 1st file, and see if they are in the same or different blocks, and compare that to the 2nd text file.
2)I need to calculate the percentage of correct classification.
3) I need to count:
% of sentence pairs correctly classified as in the same block, % of sentence pairs correctly classified as in different blocks
Please provide more information about the structure of the files. Also, I'd like to remind you that StackOverFlow helps you out providing aditional information in order to solve your problem, we are not going to code the solution for you.
Provide aditional information and your code if possible in order to help you out.
Have a great day.
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Title
Subtitle1
Body1
Subtitle2
Body2
OR
Title
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it.
Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
Edit:
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10
And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)
There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.
Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:
https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html
I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.
Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf
There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:
Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
Preprocess your PDF to an image and apply other relevant preprocessing
Get the output as a dataframe and combine individual words into lines of words by word_num
Compute the thickness of every line of text (by the use of the image and tesseract output)
Convert image to grayscale and invert the image colors
Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
Sum where there are white pixels in the inverted image
Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
Normalize the thickness by minimum and maximum values
Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
Calculate the font size of every word by converting height in pixels to height in points
The median font size becomes the local font size for every line of text
Finally, you can categorize titles, subtitles, and everything in between can be text.
Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.
Relevant research papers:
An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. DOI: 10.1007/978-3-642-40501-3_15.
Image-based logical document structure recognition. DOI: 10.1007/978-3-642-40501-3_15.
I did some research and experiments on this topic, so let me try giving a few of the hints I got from the job, which is still far from perfect.
I haven't found any reliable library to do it, although having the time and possibly the competences (I am still relatively inexperienced in reading other's code) I would have liked checking some of the work out there, one in particular (parsr).
I did reach some decent results in headers/title recognition by applying filters to Tesseract's hOCR output. It requires extensive work, i.e.
OCR the pdf
Properly parse the resulting hOCR, so that you can access its paragraphs, lines and words
Scan each line's height, by splitting their bounding boxes
Scan each word's width and height, again splitting bounding boxes, and keep track of them
Heights are needed to intercept false positives, because line heights are sometimes inflated
Find out the most frequent line height, so that you have a baseline for the general base font
Start by identifying the lines that have height higher than the baseline found in #6
Eliminate false positives checking if there a max height of the line's words that matches the line's one, otherwise use the max word height of each line to compare against the #6 baseline.
Now you have a few candidates, and you want to check that
a. The candidate line does not belong to a paragraph whose other lines do not respect the same height, unless it's the first line (sometimes Tesseract joins the heading with the paragraph).
b. The line does not end with "." or "," and possibly other markers that rule out a title/heading
The list runs quite a bit longer. E.g. you might want to apply also some other criteria
like comparing same word widths: if in a line you find more than a certain number of words (I use >= 50%) that are larger than average -- compared to the same word elsewhere in the document -- you almost certainly have a good candidate header or title. (Titles and headers typically have words that appear also in the document, often multiple times)
Another criteria is checking for all caps lines, and a reinforcement can be single liners (lines that belong to a paragraph with just one line).
Sorry I can't post any code (*), but hopefully you got the gist.
It's not exactly an easy feat and requires a lot of work if you don't use ML. Not sure how much ML would make it faster either, because there's a ton of PDFs out there, and probably the big guys (Adobe, Google, Abbyy, etc) trained their models for quite a while.
(*) My code is in JS, and it's seriously intertwined in a large converting application, which so far I can't post open source. I am reasonably sure you can do the job in Python, although the JS DOM manipulation might be somewhat an advantage there.
I am quite new to OCR and to Tesseract.
So far I have a working script that is extracting fairly good text from images.
My doubt: is possible to train tesseract to retrieve only words/chars presented in some kind of dictionary file??
For example, I have an .txt with a big list of person names, and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK" not "VANNlD", etc...
If it has a list of all possible names it will be able to give better accuracy? If the original image is a text with a lot of person names, and other information about that persons, but I want only to retrieve names from ocr and ignore the "noisy information", what can I do? Sorry if it is a stupid question.
I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the manual http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created the eng.user-words and the bazaar files... what should be the next step? Since it gives me same outputs...
Thanks so much for your time and patient.