Uncompress and save zlib data in PDF with python - python

We get PDF files delivered to us daily and we need to get the images out. For example, what I want to do is to get the image back out of this PDF file I have, with python. Most pdf files we get are multipage and we want to export each embedded image to separate files. Most have jpeg files in them, but his one does not.
Object 5 is embedded as a zlib compressed stream. I am pretty sure it is zlib compressed because it is marked as FlateDecode and the start of the stream is \x78\x9c which is typical for zlib. You can see (part of) the hex dump here
The question is, how do I 'deflate' it and save the resulting file.
Thank you for sharing your wisdom.

I searched everywhere and tried many things but couldn't get to work. I managed to decompress the data like this:
import zlib
with open("MDL1703140088.pdf", "rb") as f:
pdf = f.read()
image = zlib.decompress(pdf[640:69307])
640 is zlib header(b'x\x9c') position and 69307 is the position of something like footer of pdf spec. b'\nendstream\n' is there. Detail is in the spec and some helpful Q&A can be found here. But omitting the end position is allowed in this case because decompress() seems to ignore following non-compressed data. You can validate this by:
decomp = zlib.decompressobj()
image = decomp.decompress(pdf[640:])
print(decomp.unused_data) # starts from b'\nendstream\n
So far so good. But when I write image to a PNG file, it cannot be read by any image viewer. Actually decompressed data looks so quite empty here and there. I attached some PNG header, but no luck. Hey, it's too much...
As I said earlier (strangely my comment was removed by someone), you'd better use some other existing tools. If Acrobat is not your option, what about pdftopng (part of Xpdf)? pdftopng MDL1703140088.pdf . gave me a valid PNG file flawlessly. Obviously command-line tools can be executed in Python, as you may know.

Related

Determine the file size of PNG from stream

I came across a file with an obscure database format and would like to recover some information from it. After using python to open and read the file as bytes, and use re.search with pattern of b"\x89\x50\x4e\x47" (PNG file header), I found an offset in the file that matches this. Upon further examination, it's likely that this is the starting position of an actual PNG file (the first 16 bytes in hex are 89504e470d0a1a0a0000000d49484452). However, with no information regarding the size of this PNG file, how should I determine it programmatically (using the information from the header)? It would be appreciated if some existing PNG debugging tool can be used.
I already tried output the rest of this database file starting from this offset and save it as a PNG, but it doesn't work as my image viewer report the file is corrupted.
PNG images have footers that you can utilize to determine when to stop.
You can seek up until IEND, then move right 4 bytes to capture the CRC data, and extract all the bytes up to that point.
After which, you should be able to get the full PNG file.

pdf in python which consist data from .xlsx file and png image

I wanted to create a pdf using Python 3x.
The pdf should have some text data which is stored in a .xlsx file i.e.., it should read data from .xlsx file and write into the .pdf file.
Along with that, the pdf should have a png image of passport size.
I have come up with two basic ideas which are:-
First one is by writing a program which create a text file in which all required data from the pdf will be written along with the png image. After that the program will convert it into a pdf file.
Second one is by writing a program which will create the pdf file and write the data from .xlsx file as well as insert the image too into the pdf file.
I don't know whether these ideas can be used or not and how it can be used but after going through some researches on GFG, Stack overflow..., I have got totally confused and ended up asking this problem on this platform.
I have tried some modules like PIL, FPDF, reportlab,.. and am successfully able to create a pdf file with either texts or images but unable to combine both in the same text file.
Also I am confused in deciding which idea I should implement.
What I need from you guys is the answer of few of my questions which are:-
Are the ideas I mentioned above(second one specially) practically possible?
Can I make a program which imports data from file as well as png image into the same pdf. What modules and functions will be used there and how.
Please provide the code with comments or defining/elaborating the work of function used.
I hope I will get the desired result soon. Meanwhile I will try to solve it out by myself.

How to open .fif file format?

I want to open a .fif file of size around 800MB. I googled and found that these kind of files can be opened with photoshop. Is there a way to extract the images and store in some other standard format using python or c++.
This is probably an EEG or MEG data file. The full specification is here, and it can be read in with the MNE package in Python.
import mne
raw = mne.io.read_raw_fif('filename.fif')
FIF stands for Fractal Image Format and seems to be output of the Genuine Fractals Plugin for Adobe's Photoshop. Unfortunately, there is no format specification available and the plugin claims to use patented algorithms so you won't be able to read these files from within your own software.
There however are other tools which can do fractal compression. Here's some information about one example. While this won't allow you to open FIF files from the Genuine Fractals Plugin, it would allow you to compress the original file, if still available.
XnView seems to handle FIF files, but it's windows-only. There is a MP or Multiplatform version, but it seems less complete and didn't work when I tried to view a FIF file.
Update: XnView MP, which does work on Linux and OSX claims to support FIF, but I couldn't get it to work.
Update2: There's also an open source project:Fiasco that can work with fractal images, but not sure it's compatible with the proprietary FIF format.

Get EXIF data without downloading whole image - Python

Is is possible to get the EXIF information of an image remotely and with only downloading the EXIF data?
From what I can understand about EXIF bytes in image files, the EXIF data is in the first few bytes of an image.
So the question is how to download only the first few bytes of a remote file, with Python? (Edit: Relying on HTTP Range Header is not good enough, as not all remote hosts support it, in which case full download will occur.)
Can I cancel the download after x bytes of progress, for example?
You can tell the web server to only send you parts of a file by setting the HTTP range header. See This answer for an example using urllib to partially download a file. So you could download a chunk of e.g. 1000 bytes, check if the exif data is contained in the chunk, and download more if you can't find the exif app1 header or the exif data is incomplete.
This depends on the image format heavily. For example, if you have a TIFF file, there is no knowing a priori where the EXIF data, if any, is within the file. It could be right after the header and before the first IFD, but this is unlikely. It could be way after the image data. Chances are it's somewhere in the middle.
If you want the EXIF information, extract that on the server (cache, maybe) and ship that down packaged up nicely instead of demanding client code do that.

Reading Font Colour Information From a PDF

I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!
The colour for text and other filled graphics is set using one of the g, rg or k operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.
The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?

Categories

Resources