Get EXIF data without downloading whole image - Python

Get EXIF data without downloading whole image - Python - python

Is is possible to get the EXIF information of an image remotely and with only downloading the EXIF data?
From what I can understand about EXIF bytes in image files, the EXIF data is in the first few bytes of an image.
So the question is how to download only the first few bytes of a remote file, with Python? (Edit: Relying on HTTP Range Header is not good enough, as not all remote hosts support it, in which case full download will occur.)
Can I cancel the download after x bytes of progress, for example?

You can tell the web server to only send you parts of a file by setting the HTTP range header. See This answer for an example using urllib to partially download a file. So you could download a chunk of e.g. 1000 bytes, check if the exif data is contained in the chunk, and download more if you can't find the exif app1 header or the exif data is incomplete.

This depends on the image format heavily. For example, if you have a TIFF file, there is no knowing a priori where the EXIF data, if any, is within the file. It could be right after the header and before the first IFD, but this is unlikely. It could be way after the image data. Chances are it's somewhere in the middle.
If you want the EXIF information, extract that on the server (cache, maybe) and ship that down packaged up nicely instead of demanding client code do that.

Related

Determine the file size of PNG from stream

I came across a file with an obscure database format and would like to recover some information from it. After using python to open and read the file as bytes, and use re.search with pattern of b"\x89\x50\x4e\x47" (PNG file header), I found an offset in the file that matches this. Upon further examination, it's likely that this is the starting position of an actual PNG file (the first 16 bytes in hex are 89504e470d0a1a0a0000000d49484452). However, with no information regarding the size of this PNG file, how should I determine it programmatically (using the information from the header)? It would be appreciated if some existing PNG debugging tool can be used.
I already tried output the rest of this database file starting from this offset and save it as a PNG, but it doesn't work as my image viewer report the file is corrupted.

PNG images have footers that you can utilize to determine when to stop.
You can seek up until IEND, then move right 4 bytes to capture the CRC data, and extract all the bytes up to that point.
After which, you should be able to get the full PNG file.

Get Image size from URL

I have a list of URIs of images from essentially a Wordpress site.
I want to be able to have a script to get their file sizes (mb, kb, GB) from just using the URIs.
I don't have access to this server-wise and need to add the sizes to a Google sheet. This seems like the fastest way to do it as there are over 5k images and attachments.
However when I do this in Python
>>> import requests
>>> response = requests.get("https://xxx.xxxxx.com/wp-content/uploads/2017/05/resources.png")
>>> len(response.content)
3232
I get 3232 bytes but when I check in Chrome Dev Tools, it's 3.4KB
What is being added? Or is the image actually 3.4KB and my script is only checking content-length?
Also, I don't want to check using the Content-Length header as some of the images may be large and chunked so I want to be sure I'm getting the actual file size of the image.
What is a good way to go about this? I feel like there should be some minimal code or script I could run.

The value you are seeing (3.4KB) includes the network overhead such as response headers.
As a side note, I am not sure what is the version of Chrome you are using but the transfer size (including response headers) and the resource size (i.e. the file size) are displayed separately for me:

Uncompress and save zlib data in PDF with python

We get PDF files delivered to us daily and we need to get the images out. For example, what I want to do is to get the image back out of this PDF file I have, with python. Most pdf files we get are multipage and we want to export each embedded image to separate files. Most have jpeg files in them, but his one does not.
Object 5 is embedded as a zlib compressed stream. I am pretty sure it is zlib compressed because it is marked as FlateDecode and the start of the stream is \x78\x9c which is typical for zlib. You can see (part of) the hex dump here
The question is, how do I 'deflate' it and save the resulting file.
Thank you for sharing your wisdom.

I searched everywhere and tried many things but couldn't get to work. I managed to decompress the data like this:
import zlib
with open("MDL1703140088.pdf", "rb") as f:
pdf = f.read()
image = zlib.decompress(pdf[640:69307])
640 is zlib header(b'x\x9c') position and 69307 is the position of something like footer of pdf spec. b'\nendstream\n' is there. Detail is in the spec and some helpful Q&A can be found here. But omitting the end position is allowed in this case because decompress() seems to ignore following non-compressed data. You can validate this by:
decomp = zlib.decompressobj()
image = decomp.decompress(pdf[640:])
print(decomp.unused_data) # starts from b'\nendstream\n
So far so good. But when I write image to a PNG file, it cannot be read by any image viewer. Actually decompressed data looks so quite empty here and there. I attached some PNG header, but no luck. Hey, it's too much...
As I said earlier (strangely my comment was removed by someone), you'd better use some other existing tools. If Acrobat is not your option, what about pdftopng (part of Xpdf)? pdftopng MDL1703140088.pdf . gave me a valid PNG file flawlessly. Obviously command-line tools can be executed in Python, as you may know.

Upload image with an in-memory stream to input using Pillow + WebDriver?

I'm getting an Image from URL with Pillow, and creating an stream (BytesIO/StringIO).
r = requests.get("http://i.imgur.com/SH9lKxu.jpg")
stream = Image.open(BytesIO(r.content))
Since I want to upload this image using an <input type="file" /> with selenium WebDriver. I can do something like this to upload a file:
self.driver.find_element_by_xpath("//input[#type='file']").send_keys("PATH_TO_IMAGE")
I would like to know If its possible to upload that image from a stream without having to mess with files / file paths... I'm trying to avoid filesystem Read/Write. And do it in-memory or as much with temporary files. I'm also Wondering If that stream could be encoded to Base64, and then uploaded passing the string to the send_keys function you can see above :$
PS: Hope you like the image :P

You seem to be asking multiple questions here.
First, how do you convert a a JPEG without downloading it to a file? You're already doing that, so I don't know what you're asking here.
Next, "And do it in-memory or as much with temporary files." I don't know what this means, but you can do it with temporary files with the tempfile library in the stdlib, and you can do it in-memory too; both are easy.
Next, you want to know how to do a streaming upload with requests. The easy way to do that, as explained in Streaming Uploads, is to "simply provide a file-like object for your body". This can be a tempfile, but it can just as easily be a BytesIO. Since you're already using one in your question, I assume you know how to do this.
(As a side note, I'm not sure why you're using BytesIO(r.content) when requests already gives you a way to use a response object as a file-like object, and even to do it by streaming on demand instead of by waiting until the full content is available, but that isn't relevant here.)
If you want to upload it with selenium instead of requests… well then you do need a temporary file. The whole point of selenium is that it's scripting a web browser. You can't just type a bunch of bytes at your web browser in an upload form, you have to select a file on your filesystem. So selenium needs to fake you selecting a file on your filesystem. This is a perfect job for tempfile.NamedTemporaryFile.
Finally, "I'm also Wondering If that stream could be encoded to Base64".
Sure it can. Since you're just converting the image in-memory, you can just encode it with, e.g., base64.b64encode. Or, if you prefer, you can wrap your BytesIO in a codecs wrapper to base-64 it on the fly. But I'm not sure why you want to do that here.

Getting the length of a ogg track from s3 without downloading the whole file

How do I get the play length of an ogg file without downloading the whole file? I know this is possible because both the HTML5 tag and VLC can show the entire play length immediately after loading the URL, without downloading the entire file.
Is there a header or something I can read. Maybe even the bitrate, which I can divide by the file size to get an approximate play length?

Unfortunately there does not appear to be a way to achieve this.
Mozilla's Configuring servers for Ogg media is very instructive. Basically:
Gecko uses the X-Content-Duration header - sent by the web server if it has it. This explains the HTML5 audio streaming example you raised. If missing, then
Gecko estimates the length based on the sample-rate (in the header) and the size of the file from the Content-length HTTP header
The sample rate is stored in the Identification Header - the first header packet. See the specification go to section "4.2 Header decode and decode setup"

This is possible. The way to do it is to use HTTP range requests to fetch the end of the file, find the last Ogg page, and extract the timestamp from it. This is assuming that the file consists of contiguous streams (i.e. no chaining) which all have the same length, and that the stream time starts at 0 (otherwise, you should decode the beginning of the stream and subtract that timestamp from the final timestamp). Decoding the timestamp from the Ogg Page granulepos field is codec-specific (e.g. for Vorbis it is expressed as a number of samples).
Alternatively, if your Ogg file has Ogg Skeleton metadata, you can read that directly to determine the duration of the file.

This is just not possible without download the data itself. You could specify the related information as part of the S3 metadata of the related key. So could write to introspect the metadata before actually downloading the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.