I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!
The colour for text and other filled graphics is set using one of the g, rg or k operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.
The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?
Related
I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.
iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.
I will explain my dilemma first: I have several thousand powerpoint files (.ppt) that I need to extract the text. The problem is the text is is disorganized in the file and when read as a complete page it makes no sense for what I need (it would read in the example: line 1, line 3, line 2, line 4, line 5).
I was using tika to read the files initially. I then thought if I converted to pdf using glob and win32com.client that I would have some better luck but it's basically the same result. The picture here is an example of what the text is like.
So now my idea now is if I can section the pdf or ppt by pixel location (and save to separate temp files if needed, opened, and read that way) I can keep things in order and get what I need. Although the text moves around within each box, the black outline boxes are always roughly in the same location.
I cannot find anything to split an individual pdf page though, only multiple pages into a single page. Does anyone have an idea how to go about doing this?
I need to read the text in box one together (line 1 and line 2) and load into a dictionary or some other container, and the same for the second box. For reference there is only one slide in the powerpoint.
Allow me to provide the answer as a general guideline:
Both .ppt and .pptx files are glorified .zip files.
Use 7-zip or WinZip to open the .pptx and understand the structure.
Convert them into a .pptx file.
Each slide should now have a .xml file full of tags you can parse.
For example you will find tags for each text box with tags for that box's text nested inside.
Also: python-pptx
Mass convert by tweaking this VBA code: Link for VBA
Or using PowerShell: Link for [PowerShell]
I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that requires imagery and background designs - is it possible to build a parser that can read the data from these PowerPoint PDFs without the slide decor getting in the way?
So the workflow would basically be this:
At the end of a project, the project report is delivered in the form of a presentation.
The presentation would be converted to PDF.
The PDF would be submitted to my application.
The application would read the slides and create a data-focused report for quick review.
The goal of the application is to cut down on the amount of reading that needs to be done by a significant amount as some of these presentation reports can be many pages long with not enough time in the day.
Parsing PDFs into structured data is always tricky, as the format is geared towards precise printing, rather than ease of editing or data extraction.
Basically, a PDF contains information like "there's a label with such text at such (x,y) position on a certain page", or things like that.
Basically, you will very likely need some heuristics in order to turn that into structured data.
It will basically be a form of scraping.
Search on your favorite search engine for PDF scraping, or something like that, and it would be a good start.
Also, you may want to look at those similar posts:
PDF Data and Table Scraping to Excel
How to extract table as text from the PDF using Python?
A PowerPoint PDF isn't a type of PDF.
There isn't going to be anything natively in the PDF that identifies elements on the page as being 'slide' graphics the originated from a PowerPoint file for example.
You could try building an algorithm that makes decision about content to drop from the created PDF but that would be tricky and seems like the wrong approach to me.
A better approach would be to "Export" the PPT to text first, e.g. in Microsoft PowerPoint Export it to a RTF file so you get all of the text out and use that directly or then convert that to PDF.
I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.
We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.
So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.
The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:
Book --> o2m --> Chapter --> o2m --> Image
Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.
Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.
I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.
Textract is the best tool that i have encountered so far for parsing different file formats.
It can parse most of the file formats.
You can find the project on Github
Here is the official documentation
(Python 3 answer)
When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/
What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.
I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:
import zipfile
try:
from xml.etree.cElementTree import XML
print("cElementTree")
except ImportError:
from xml.etree.ElementTree import XML
print("ElementTree")
EDIT:
If, in the body of the function, you replace
'word/document.xml'
with
'word/footnotes.xml'
or
'word/endnotes.xml'
you can get the footnotes and endnotes, respectively.
The markers for where they were in the text are lost, however.
I'm using pyPdf to merge several PDF files into one. This works great, but I would also need to add a table of contents/outlines/bookmarks to the PDF file that is generated.
pyPdf seems to have only read support for outlines. Reportlab would allow me to create them, but the opensource version does not support loading PDF files, so that doesn't work to add outlines to an existing file.
Is there any way I can add outlines to an existing PDF using Python, or any library that would allow that?
https://github.com/yutayamamoto/pdfoutline
I made a python library just for adding an outline to an existing PDF file.
It looks like pypdf can do the job. See the add_outline_item method in the documentation.
We had a similar problem in WeasyPrint: cairo produces the PDF files but does not support bookmarks/outlines or hyperlinks. In the end we bit the bullet, read the PDF spec, and did it ourselves.
WeasyPrint’s pdf.py has a simple PDF parser and writer that can add/override PDF "objects" to an existing documents. It uses the PDF "update" mechanism and only append at the end of the file.
This module was made for internal use only but I’m open to refactoring it to make it easier to use in other projects.
However the parser takes a few shortcuts and can not parse all valid PDF files. It may need to be adapted if PyPDF’s output is not as nice as cairo’s. From the module’s docstring:
Rather than trying to parse any valid PDF, we make some assumptions
that hold for cairo in order to simplify the code:
All newlines are '\n', not '\r' or '\r\n'
Except for number 0 (which is always free) there is no "free" object.
Most white space separators are made of a single 0x20 space.
Indirect dictionary objects do not contain '>>' at the start of a line except to mark the end of the object, followed by 'endobj'. (In
other words, '>>' markers for sub-dictionaries are indented.)
The Page Tree is flat: all kids of the root page node are page objects, not page tree nodes.