Get Image size from URL - python

I have a list of URIs of images from essentially a Wordpress site.
I want to be able to have a script to get their file sizes (mb, kb, GB) from just using the URIs.
I don't have access to this server-wise and need to add the sizes to a Google sheet. This seems like the fastest way to do it as there are over 5k images and attachments.
However when I do this in Python
>>> import requests
>>> response = requests.get("https://xxx.xxxxx.com/wp-content/uploads/2017/05/resources.png")
>>> len(response.content)
3232
I get 3232 bytes but when I check in Chrome Dev Tools, it's 3.4KB
What is being added? Or is the image actually 3.4KB and my script is only checking content-length?
Also, I don't want to check using the Content-Length header as some of the images may be large and chunked so I want to be sure I'm getting the actual file size of the image.
What is a good way to go about this? I feel like there should be some minimal code or script I could run.

The value you are seeing (3.4KB) includes the network overhead such as response headers.
As a side note, I am not sure what is the version of Chrome you are using but the transfer size (including response headers) and the resource size (i.e. the file size) are displayed separately for me:

Related

Python Pillow doesn't work with some images

I have 30 000 images to check for size, format and some other things.
I've checked all of them except 200 images. These 200 images give an error in Pillow
from PIL import Image
import requests
url = 'https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop.svg'
image = Image.open(requests.get(url, stream=True).raw)
This gives and error:
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fbfbf59c810>
Here are some other images, that give the same error:
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/logo/y-logo.png
https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop.svg
https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop_futer.svg
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/icons/googleplay.png
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/icons/appstore.png
If I download these images - everything works fine. But I need to check them without downloading. Is there any solution?
You're not checking for any errors you might get from requests responses, so chances are you might be trying to identify e.g. an error page.
Pillow doesn't support SVG files (and they don't necessarily have an intrinsic size anyway). You'll need something else to identify them.
You're explicitly asking requests to give you the raw stream, not something that may have been e.g. decompressed if there's a transport encoding. For that y-logo.png, the server responds with a response that has Content-Encoding: gzip, so no wonder you're having a hard time. You might want to just not use stream=True and .raw, but instead read the response into memory, wrap it with io.BytesIO(resp.content) and pass that to Pillow. If that's not an option, you could also write a file-like wrapper around a requests response, but it's likely not worth the effort.
To save a bunch of time (by reusing connections), use a Requests session.

python: streaming httpRequest / loading website partially

I would like to know if there is a way in python to "stream" httpRequests in order to avoid loading the whole page.
What I´m currently doing to get the html data of a given url is this:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
This way I´m always loading the whole website, but since I only need a small part of it I´m using more bandwith then I need to. If I could stop loading the website after I found a specific value / expression, or even better if I could specify where to start / end loading the website eg. starting at character #3000 loading until #5000 I´d save a lot of bandwith.
thanks in advance
tschery
This stackoverflow answer shows how to do partial HTTP loading in Python. You can also use response.read(N) (N being the number of bytes to read) but there is no guarantee that the exact amount you specify is downloaded.

Get EXIF data without downloading whole image - Python

Is is possible to get the EXIF information of an image remotely and with only downloading the EXIF data?
From what I can understand about EXIF bytes in image files, the EXIF data is in the first few bytes of an image.
So the question is how to download only the first few bytes of a remote file, with Python? (Edit: Relying on HTTP Range Header is not good enough, as not all remote hosts support it, in which case full download will occur.)
Can I cancel the download after x bytes of progress, for example?
You can tell the web server to only send you parts of a file by setting the HTTP range header. See This answer for an example using urllib to partially download a file. So you could download a chunk of e.g. 1000 bytes, check if the exif data is contained in the chunk, and download more if you can't find the exif app1 header or the exif data is incomplete.
This depends on the image format heavily. For example, if you have a TIFF file, there is no knowing a priori where the EXIF data, if any, is within the file. It could be right after the header and before the first IFD, but this is unlikely. It could be way after the image data. Chances are it's somewhere in the middle.
If you want the EXIF information, extract that on the server (cache, maybe) and ship that down packaged up nicely instead of demanding client code do that.

Getting image sizes like Facebook link scraper

I'm implementing my own link scraper to copy Facebook's technique as closely as possible (unless someone has a ready made lib for me...).
According to the many answers on SO, Facebook's process for determining the image to associate with a shared link involves searching for several recognized meta tags and then, if those are not found, stepping through the images on the page and returning a list of appropriately sized ones (at least 50px by 50px, have a maximum aspect ratio of 3:1, and in PNG, JPEG or GIF format according to this answer)
My question is, how does Facebook get the size information of the images? Is it loading all images for each shared link and inspecting them? Is there more efficient way to do this. (My backend is Python.)
(Side note: Would it make sense to use a client-side instead of server-side approach?)
Is there more efficient way to do this.
Most common “web” graphic formats – JPEG, GIF, PNG – contain info about the width & height in the header (or at least in the first block, for PNG).
So if the remote web server is accepting range requests it’d be possible to only request the first X bytes of an image resource instead of the whole thing to get the desired information.
(This is what Facebook’s scraper does for HTML pages, too – it’s quite common that you see in the debugger that the request was answered with HTTP status code 206 Partial Content – that meaning Facebook said they’re only interested in the first X (K)Bytes (for meta elements in head), and the web server was able to give them only that.

How to upload huge files from Nokia 95 to webserver?

I'm trying to upload a huge file from my Nokia N95 mobile to my webserver using Pys60 python code. However the code crashes because I'm trying to load the file into memory and trying to post to a HTTP url. Any idea how to upload huge files > 120 MB to webserver using Pys60.
Following is the code I use to send the HTTP request.
f = open(soundpath + audio_filename)
fields = [('timestamp', str(audio_start_time)), ('test_id', str(test_id)), ('tester_name', tester_name), ('sensor_position', str(sensor_position)), ('sensor', 'audio') ]
files = [('data', audio_filename, f.read())]
post_multipart(MOBILE_CONTEXT_HOST, MOBILE_CONTEXT_SERVER_PORT, '/MobileContext/AudioServlet', fields, files)
f.close
where does this post_multipart() function comes from ?
if it is from here, then it should be easy to adapt the code so that it takes a file object in argument and not the full content of the file, so that post_mutipart reads small chunks of data while posting instead of loading the whole file in memory before posting.
this is definitely possible.
You can't. It's pretty much physically impossible. You'll need to split the file into small chunks and upload it bit by bit, which is very difficult to do quickly and efficiently on that sort of platform.
Jamie
You'll need to craft a client code to split your source file in small chunks and rebuild that pieces server-side.

Categories

Resources