Extracting bitmap from a file

Extracting bitmap from a file - python

given a somewhat complex file of unknown specification that among other things contains an uncompressed bitmap file (.BMP), how would you extract it in Python?
Scan for the "BM" tag and see if the following bytes "resemble" a BMP header?

I'd use the Python Imaging Library PIL and have it a go at the data. If it can parse it, then it's a valid image. When it throws an exception, then it isn't.
You need to search for the begining of the image; if you're lucky, the image reader will ignore garbage after the image data. When it doesn't, use a binary search to locate the end of the image.

Yes, about the only thing you can do is search through the file for the 'BM' marker, pull out the following data into a BITMAPFILEHEADER and corresponding BITMAPINFO, and see if the values in it look valid (i.e. that the dimensions are sensible, colour depth is reasonable, etc).
Once you have found something that looks reasonable, pull that data out and pass it to the library mentioned in another answer.

Related

How to save jpeg data that is identical to the original jpeg file using OpenCV

I'm using OpenCV and Python. I have loaded a jpeg image into a numpy array. Now i want to save it back into jpeg format, but since the image was not modified, I don't want to compress it again. Is it possible to create a jpeg from the numpy array that is identical with the jpeg that it was loaded from?
I know this workflow (decode-encode without doing anything) sounds a bit stupid, but keeping the original jpeg data is not an option. I'm interested if it is possible to recreate the original jpeg just using the data at hand.
The question is different from Reading a .JPG Image and Saving it without file size change, as I don't modify anything in the picture. I really want to restore the original jpeg file based on the data at hand. I assume one could bypass the compression steps (the compression artifacts are already in the data) and just write the file in jpeg format. The question is, if this is possible with OpenCV.

Clarified answer, following comment below:
What you say makes no sense at all; You say that you have the raw, unmodified, RGB data. No you don't. You have the uncompressed data that has been reconstructed from the compressed jpeg file.
The JPEG standards specify how to un-compress an image / video. There is nothing in the standard about how to actually do this compression, so your original image data could have been compressed any one of a zillion different ways. You have no way of knowing the decoding steps that were required to recreate your data, so you cannot reverse them.
Image this.
"I have a number, 44, please tell me how I can get the original
numbers that this came from"
This is, essentially, what you are asking.
The only way you can do what you want (other than just copy the original file) is to read the image into an array before loading into openCV. Then if you want to save it, then just write the raw array to a file, something like this:
fi = 'C:\\Path\\to\\Image.jpg'
fo = 'C:\\Path\\to\\Copy_Image.jpg'
with open(fi,'rb') as myfile:
im_array = np.array(myfile.read())
# Do stuff here
image = cv2.imdecode(im_array)
# Do more stuff here
with open(fo,'wb') as myfile:
myfile.write(im_array)
Of course, it means you will have the data stored twice, effectively, in memory, but this seems to me to be your only option.
Sometimes, no matter how hard you want to do something, you have to accept that it just cannot be done.

Using PDFTron in Python, remove all image elements from a PDF with given size characteristics

I'm trying to remove a large number of very small images from a series of PDF documents using the awesome looking PDFTron library for Python. Basically I want to create a new PDF by going over each element in an existing PDF file and copying the ones that meet a certain size criteria to the new PDF in the same position.
Can someone guide me to PDFTron documentation specifically for Python to help me accomplish this? Or provide a sample script that checks for image size? I think I can do the rest (emphasis on think). The documentation available on the PDFTron website is not specifically for Python, hard to look up what I need...

You can see from the ElementEdit sample how to remove all images from a document:
http://www.pdftron.com/pdfnet/samplecode.html#ElementEdit
Or provide a sample script that checks for image size?
Could you clarify what you mean by "image size"? If you mean the image's dimensions as displayed in the PDF page, you can check that using Element.GetBBox. If you mean the dimensions of the original image, you could check that using Element.GetImageWidth and Element.GetImageHeight (see http://www.pdftron.com/pdfnet/samplecode.html#ImageExtract). Also, Image.GetImageDataSize gives you the size of the image data in bytes.

Best way of Getting Swf File Dimensions with Python

Edit:
I'm considering hexagonit, and swfTools. Does anyone have any other solutions, or insight?
Edit:
New Question - how to solve this error:
I tried using hexagonit.swfheader however I receive the error:
f = 'out/'+w+"/"+s+"/"+"theSwf/"+s
data = hexagonit.swfheader.parse(f)
File "/Library/Python/2.7/site-packages/hexagonit/swfheader/__init__.py", line 26, in parse
signature = ''.join(struct.unpack('<3c', input.read(3)))
struct.error: unpack requires a string argument of length 3
After tracing through this I found that the error occurs here:
def parse(input):
"""Parses the header information from an SWF file."""
need_close=False
if hasattr(input, 'read'):
input.seek(0)
else:
input = open(input, 'rb')*
need_close=True
* being where the error occurs.
I read: Get dimensions from a flash file (.swf) in pure-Python
I have also tried using PIL and Pillow however I am under the impression that they compare images not swf files. I decompiled the swf file's I'm looking at along but also have the swf file itself.
I would like to know what size the file displays as (dimensions).
My first thought was to try using image size comparison.
My issue with this is that some images that are used as assets in the swf are actually larger than the swf itself, otherwise I would use PIL to simply get the dimensions of the largest image asset (ex the background).
Secondly my other issue is that can equally compare svg and png files.. and Pillow and Pil to my knowledge do not handle svg files.
My second idea was to search the actionscript code for the dimensions.
Some files have in their action script something like 300x300 which denotes the size. Unfortunately after looking at most of the files I am working with do not which means this is largely unhelpful.
My 3rd idea was to ignore the decompiled swf data and rather focus on the swf itself.
I could in theory either try to find the dimensions in the byte code (or use a library that does this (which I need to find one as pip and pillow do not appear to work)) or I need to run the ad and then screenshot it and try to find where the ad starts and stops and calculate the pixels based on that. My problem with screens shotting it is that the image may blend into the background and make it hard if not impossible to get the correct dimensions, but more importantly many swfs cannot be played due to security if they are not played in the right url, etc.
Thus I'm left with a dilemma. I think the best way to go about this would be to use the swf file itself.

Take a look at the official SWF file format spec. The dimension info you are looking for should be right near the beginning of the file. Take a look at the section "The SWF header"
The FrameSize field defines the width and height of the on-screen display. This field is stored as a RECT
structure, meaning that its size may vary according to the number of bits needed to encode the coordinates. The
FrameSize RECT always has Xmin and Ymin value of 0; the Xmax and Ymax members define the width and height
(see Using bit values).

how to create BMP image from binary data in Python?

Suppose I already read binary data from a binary file, how can I create a BMP image from that binary data?

You can find a definition of the bitmap file format on Wikipedia among other places. Use the struct module to create the necessary headers. Because the format is uncompressed it is very easy to write out. The color information must come in BGR order, bottom line to top line, and each line must be padded with zeros to a multiple of 4 bytes.
Or if you'd rather do it the easy way, PIL knows how to read and write BMP.

what causes "insufficient data for image" in a pdf

I have a program in Python (using pyPDF) that merges a bunch of different PDF documents. Sometimes, the resulting pdf is fine, except for some blank pages in the middle. When I view these documents with Acrobat Reader, I get an error message saying "insufficient data for image". When I view the documents with FoxIT Reader, I get some blank pages and a munged image.
The only odd thing about the PDF that creates the blank pages is that it seems to be PDF Version 1.4, and PyPdf seems to create files with PDF Version 1.3.
1) Does the version thing sound like the root cause of my problem?
2) Is there a way to get PyPdf to handle this correctly?

This might be related to Windows not actually the .pdf file.
http://support.microsoft.com/kb/2506795
Good luck!

I had this problem, and was able to figure it out by looking at the original pdf side by side with the PyPDF one in a hex editor.
The problem seems to be that PyPDF actually leaves off a byte - it looks like probably the first byte in each image stream is missing. When I added the bytes to the PyPDF file, the pdf opened up fine without the error.

I suspect that the image XObject stream is Malformed. Without access to a PDF with the problem, all most folks can do is guess.
For example, if the pdf info says the image is 10 pixels wide, 10 pixels high, and 8 bits per pixel, then the stream should uncompress to 100 bytes. If it uncompressed to less than that, I'd expect an error like the one you're seeing.
The is probably a bug in pypdf regarding whatever image format you happen to be using.
IIRC, there's is no scan-line padding in PDF and no concern for word boundaries, though the last bits are padded out to a byte if need be. Confusion there could easily lead to too many bytes, which isn't the problem here.
It could also be a bad color space. If you've got an indexed color image (gif), and they translate it half way to an RGB image, but use the original indexed color bytes, you'd get a stream that might expect n*3 bits per pixel, but only have n bits per pixel.
It's possible that this is an older bug that's been fixed in pypdf. Are you using the current version?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.