Finding the title of a PDF with Python - python

I have a PDF file, and I would like to extract its title into a string. By title I don't mean the title in the metadata, but the actual title written in the document. For example, from here I'd like to get "Official SAT® Practice Test 2014-15"
Is there any way to accomplish this?

I would take a look at PDFMiner. Essentially you can load your PDF programatically. Then you will need to do some type of analysis to figure out how to extract the title. Perhaps you try using the first until new line break, or some type of algorithmic approach. I recommend using a large set of PDFs where you know the title, and run your program against them to test to see if you successfully detect the title. Then you can use that code to process the PDFs where you don't know the title. This technique is commonly referred to as using a training set.

Related

Best way to get text of a pdf file?

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction?
I tried to use PyPDF2 package but it gives me inconsistent results. Also, i would like a lot to have a way to get the tables, the images, and remove the headers and the footers at least consistently, it doesn't need to happens 100% of the times. Thanks for your answers, i just need to find the right library. Thanks!
From another post that asked pretty much the same:
The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.
Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

How to create a dynamic form with python using translated text as input?

I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.
To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.

Extract Geometry Elements from PDF by OCG (by Layer)

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.
At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.
Can anyone here help end this rabbit hole of darkness? Much appreciated!
A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.
The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.
Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.
The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser
If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.
Use the following command to check all available OGC layers in your PDF file:
gdalinfo "sample.pdf" -mdd LAYERS
Then, use the following to command to extract the partiular layer:
gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"
More details are mentioned here.
Hey #pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file.
https://pypi.org/project/pdflayers/
Pretty much it means disable the default state of the layer
in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html
Any layer not visible meaning that layer will not render to the pdf document when you process (by default).

Automatically create title from a given text

I am trying to write a program that will give an apt title when an article is give ( usually an abstract). Is there any standard algorithm available?
If you want to do it by hand, you'd have to start with something like word frequency counting, then analyzing phrases that appear a lot or words that appear around each other. I have only briefly touched this topic in Java, but there seems to be a good book for Python that deals with text analysis:
Text Processing in Python
OpenFTS, an open full text search engine has a Python interface, called [PyFTS].3
Check it out. Maybe that's what you want.

Categories

Resources