I came across a file with an obscure database format and would like to recover some information from it. After using python to open and read the file as bytes, and use re.search with pattern of b"\x89\x50\x4e\x47" (PNG file header), I found an offset in the file that matches this. Upon further examination, it's likely that this is the starting position of an actual PNG file (the first 16 bytes in hex are 89504e470d0a1a0a0000000d49484452). However, with no information regarding the size of this PNG file, how should I determine it programmatically (using the information from the header)? It would be appreciated if some existing PNG debugging tool can be used.
I already tried output the rest of this database file starting from this offset and save it as a PNG, but it doesn't work as my image viewer report the file is corrupted.
PNG images have footers that you can utilize to determine when to stop.
You can seek up until IEND, then move right 4 bytes to capture the CRC data, and extract all the bytes up to that point.
After which, you should be able to get the full PNG file.
Related
I wanted to create a pdf using Python 3x.
The pdf should have some text data which is stored in a .xlsx file i.e.., it should read data from .xlsx file and write into the .pdf file.
Along with that, the pdf should have a png image of passport size.
I have come up with two basic ideas which are:-
First one is by writing a program which create a text file in which all required data from the pdf will be written along with the png image. After that the program will convert it into a pdf file.
Second one is by writing a program which will create the pdf file and write the data from .xlsx file as well as insert the image too into the pdf file.
I don't know whether these ideas can be used or not and how it can be used but after going through some researches on GFG, Stack overflow..., I have got totally confused and ended up asking this problem on this platform.
I have tried some modules like PIL, FPDF, reportlab,.. and am successfully able to create a pdf file with either texts or images but unable to combine both in the same text file.
Also I am confused in deciding which idea I should implement.
What I need from you guys is the answer of few of my questions which are:-
Are the ideas I mentioned above(second one specially) practically possible?
Can I make a program which imports data from file as well as png image into the same pdf. What modules and functions will be used there and how.
Please provide the code with comments or defining/elaborating the work of function used.
I hope I will get the desired result soon. Meanwhile I will try to solve it out by myself.
We get PDF files delivered to us daily and we need to get the images out. For example, what I want to do is to get the image back out of this PDF file I have, with python. Most pdf files we get are multipage and we want to export each embedded image to separate files. Most have jpeg files in them, but his one does not.
Object 5 is embedded as a zlib compressed stream. I am pretty sure it is zlib compressed because it is marked as FlateDecode and the start of the stream is \x78\x9c which is typical for zlib. You can see (part of) the hex dump here
The question is, how do I 'deflate' it and save the resulting file.
Thank you for sharing your wisdom.
I searched everywhere and tried many things but couldn't get to work. I managed to decompress the data like this:
import zlib
with open("MDL1703140088.pdf", "rb") as f:
pdf = f.read()
image = zlib.decompress(pdf[640:69307])
640 is zlib header(b'x\x9c') position and 69307 is the position of something like footer of pdf spec. b'\nendstream\n' is there. Detail is in the spec and some helpful Q&A can be found here. But omitting the end position is allowed in this case because decompress() seems to ignore following non-compressed data. You can validate this by:
decomp = zlib.decompressobj()
image = decomp.decompress(pdf[640:])
print(decomp.unused_data) # starts from b'\nendstream\n
So far so good. But when I write image to a PNG file, it cannot be read by any image viewer. Actually decompressed data looks so quite empty here and there. I attached some PNG header, but no luck. Hey, it's too much...
As I said earlier (strangely my comment was removed by someone), you'd better use some other existing tools. If Acrobat is not your option, what about pdftopng (part of Xpdf)? pdftopng MDL1703140088.pdf . gave me a valid PNG file flawlessly. Obviously command-line tools can be executed in Python, as you may know.
How do I get the play length of an ogg file without downloading the whole file? I know this is possible because both the HTML5 tag and VLC can show the entire play length immediately after loading the URL, without downloading the entire file.
Is there a header or something I can read. Maybe even the bitrate, which I can divide by the file size to get an approximate play length?
Unfortunately there does not appear to be a way to achieve this.
Mozilla's Configuring servers for Ogg media is very instructive. Basically:
Gecko uses the X-Content-Duration header - sent by the web server if it has it. This explains the HTML5 audio streaming example you raised. If missing, then
Gecko estimates the length based on the sample-rate (in the header) and the size of the file from the Content-length HTTP header
The sample rate is stored in the Identification Header - the first header packet. See the specification go to section "4.2 Header decode and decode setup"
This is possible. The way to do it is to use HTTP range requests to fetch the end of the file, find the last Ogg page, and extract the timestamp from it. This is assuming that the file consists of contiguous streams (i.e. no chaining) which all have the same length, and that the stream time starts at 0 (otherwise, you should decode the beginning of the stream and subtract that timestamp from the final timestamp). Decoding the timestamp from the Ogg Page granulepos field is codec-specific (e.g. for Vorbis it is expressed as a number of samples).
Alternatively, if your Ogg file has Ogg Skeleton metadata, you can read that directly to determine the duration of the file.
This is just not possible without download the data itself. You could specify the related information as part of the S3 metadata of the related key. So could write to introspect the metadata before actually downloading the data.
Is is possible to get the EXIF information of an image remotely and with only downloading the EXIF data?
From what I can understand about EXIF bytes in image files, the EXIF data is in the first few bytes of an image.
So the question is how to download only the first few bytes of a remote file, with Python? (Edit: Relying on HTTP Range Header is not good enough, as not all remote hosts support it, in which case full download will occur.)
Can I cancel the download after x bytes of progress, for example?
You can tell the web server to only send you parts of a file by setting the HTTP range header. See This answer for an example using urllib to partially download a file. So you could download a chunk of e.g. 1000 bytes, check if the exif data is contained in the chunk, and download more if you can't find the exif app1 header or the exif data is incomplete.
This depends on the image format heavily. For example, if you have a TIFF file, there is no knowing a priori where the EXIF data, if any, is within the file. It could be right after the header and before the first IFD, but this is unlikely. It could be way after the image data. Chances are it's somewhere in the middle.
If you want the EXIF information, extract that on the server (cache, maybe) and ship that down packaged up nicely instead of demanding client code do that.
I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!
The colour for text and other filled graphics is set using one of the g, rg or k operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.
The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?