Python Pillow doesn't work with some images

Python Pillow doesn't work with some images - python

I have 30 000 images to check for size, format and some other things.
I've checked all of them except 200 images. These 200 images give an error in Pillow
from PIL import Image
import requests
url = 'https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop.svg'
image = Image.open(requests.get(url, stream=True).raw)
This gives and error:
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fbfbf59c810>
Here are some other images, that give the same error:
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/logo/y-logo.png
https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop.svg
https://img.yakaboo.ua/media/wysiwyg/ePidtrymka_desktop_futer.svg
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/icons/googleplay.png
https://www.yakaboo.ua/ua/skin/frontend/bootstrap/yakaboo/images/icons/appstore.png
If I download these images - everything works fine. But I need to check them without downloading. Is there any solution?

You're not checking for any errors you might get from requests responses, so chances are you might be trying to identify e.g. an error page.
Pillow doesn't support SVG files (and they don't necessarily have an intrinsic size anyway). You'll need something else to identify them.
You're explicitly asking requests to give you the raw stream, not something that may have been e.g. decompressed if there's a transport encoding. For that y-logo.png, the server responds with a response that has Content-Encoding: gzip, so no wonder you're having a hard time. You might want to just not use stream=True and .raw, but instead read the response into memory, wrap it with io.BytesIO(resp.content) and pass that to Pillow. If that's not an option, you could also write a file-like wrapper around a requests response, but it's likely not worth the effort.
To save a bunch of time (by reusing connections), use a Requests session.

Related

How can I download free videos from youtube or any other site with Python, without an external video downloader?

Problem
Hi! Most answers refer to using pytube when using Python. But the problem is that pytube doesn't work for many videos on youtube now. It's outdated, and I always get errors. I also want to be able to get other free videos from other sites that are not on youtube.
And I know there are free sites and paid programs that let you put in a url, and it'll download it for you. But I want to understand the process of what's happening.
The following code works for easy things. Obviously it can be improved, but let's keep it super simple...
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r2---sn-vgqsknes.googlevideo.com/videoplayback?expire=1585432044&ei=jHF_XoXwBI7PDv_xtsgN&ip=12.345.678.99&id=743bcee1c959e9cd&itag=244&aitags=133%2C134%2C135%2C136%2C137%2C160%2C242%2C243%2C244%2C247%2C248%2C278%2C298%2C299%2C302%2C303&source=youtube&requiressl=yes&mh=1w&mm=31%2C26&mn=sn-vgqsknes%2Csn-ab5szn7z&ms=au%2Conr&mv=m&mvi=4&pl=23&pcm2=yes&initcwndbps=3728750&vprv=1&mime=video%2Fwebm&gir=yes&clen=22135843&dur=283.520&lmt=1584701992110857&mt=1585410393&fvip=5&keepalive=yes&beids=9466588&c=WEB&txp=5511222&sparams=expire%2Cei%2Cip%2Cid%2Caitags%2Csource%2Crequiressl%2Cpcm2%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRgIhAI3WtBFTf4kklX4xl859U8yzqavSzu-2OEn8tvHPoqAWAiEAlSDPhPdb5y4xPxPoXJFCNKr-h2c4jxKU8sAaaxxa7ok%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRQIhAJkFK4xhfLraysF13jSZpHCoklyhJrwLjNSCQ1v7IzeXAiBLpVpYf72Gp-dlvwTM2tYzMcVl4Axzm2ARd7fN1gPW-g%3D%3D&alr=yes&cpn=EvFJNwgO-zNQOWkz&cver=2.20200327.05.01&ir=1&rr=12&fexp=9466588&range=15036316-15227116&rn=14&rbuf=0'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)
This works. But when I want a youtube video (and I obviously can't use a regular youtube url because the document request is different from the video request)...
Steps taken
I'll check the network tab in the dev tools, and it's all filled with a bunch of xhr requests. The headers for them always have the very long url for the request, accept-ranges: bytes, and content-type: video/webm, or something similar for mp4, etc.
Then I'll copy the url for that xhr, change the file extension to the correct one, and run the program.
Result
Sometimes that downloads a small chunk of the video with no sound (few seconds long), and other times it will download a bigger chunk but with no image. But I want the entire video with sound.
Question
Can someone please help me understand how to do this, and explain what's happening, whether it's on another site or youtube??
Why does good_url work, but not bad_url??? I figured it might be a timeout thing, so I got that xhr, and immediately tested it from python, but still no luck.
A related question (don't worry about answering this one, unless required)...
Sometimes youtube has Blob urls in the html too, example: <video src='blob:https://www.youtube.com/f4484c06-48ed-4531-a6ee-6a3ae0291d26'>...
I've read various answers for what blobs are, and I'm not understanding it, because it looks to me like a blob url is doing an xhr to change a url in the DOM, as if it was trying to do the equivalent of an internal redirect on a webserver for a private file that would be served based on view/object level permissions. Is that what's happening? Cause I don't see the point, especially when these videos are free? The way I've done that, such as with lazy loading, is to have a data-src attribute with the correct value, and an onload event handler runs a function switching the data-src value to the src value.

you can try this video
https://www.youtube.com/watch?v=GAvr5_EtOnI
from this bad_url remove &range=3478828-4655264
try this code:
import requests
good_url = 'https://www.w3schools.com/tags/movie.mp4'
bad_url = 'https://r1---sn-gvcp5mp5u5-q5js.googlevideo.com/videoplayback?expire=1631119417&ei=2ZM4Yc33G8S8owPa1YiYDw&ip=42.0.7.242&id=o-AKG-sNstgjok92lJp_o4pF_iJ2MWD4skzEvFcTLl8LX8&itag=396&aitags=133,134,135,136,137,160,242,243,244,247,248,278,394,395,396,397,398,399&source=youtube&requiressl=yes&mh=gB&mm=31,29&mn=sn-gvcp5mp5u5-q5js,sn-i3belney&ms=au,rdu&mv=m&mvi=1&pl=24&initcwndbps=82500&vprv=1&mime=video/mp4&ns=mh_mFY1G7qq0apTltxepCQ8G&gir=yes&clen=7874090&dur=213.680&lmt=1600716258020131&mt=1631097418&fvip=1&keepalive=yes&fexp=24001373,24007246&beids=9466586&c=WEB&txp=5531432&n=Q3AfqZKoEoXUzw&sparams=expire,ei,ip,id,aitags,source,requiressl,vprv,mime,ns,gir,clen,dur,lmt&lsparams=mh,mm,mn,ms,mv,mvi,pl,initcwndbps&lsig=AG3C_xAwRQIgYZMQz5Tc2kucxFsorprl-3e4mCxJ3lpX1pbX-HnFloACIQD-CuHGtUeWstPodprweaA4sUp8ZikyxySZp1m3zlItKg==&alr=yes&sig=AOq0QJ8wRQIhAJZg4q9vLal64LO6KAyWkpY1T8OTlJRd9wNXrgDpNOuQAiB77lqm4Ka9uz2CAgrPWMSu6ApTf5Zqaoy5emABYqCB_g==&cpn=5E-Sqvee9UG2ZaNQ&cver=2.20210907.00.00&rn=19&rbuf=90320'
r = requests.get(good_url, stream=True)
with open('my_video.mp4', 'wb') as file:
file.write(r.content)

Get Image size from URL

I have a list of URIs of images from essentially a Wordpress site.
I want to be able to have a script to get their file sizes (mb, kb, GB) from just using the URIs.
I don't have access to this server-wise and need to add the sizes to a Google sheet. This seems like the fastest way to do it as there are over 5k images and attachments.
However when I do this in Python
>>> import requests
>>> response = requests.get("https://xxx.xxxxx.com/wp-content/uploads/2017/05/resources.png")
>>> len(response.content)
3232
I get 3232 bytes but when I check in Chrome Dev Tools, it's 3.4KB
What is being added? Or is the image actually 3.4KB and my script is only checking content-length?
Also, I don't want to check using the Content-Length header as some of the images may be large and chunked so I want to be sure I'm getting the actual file size of the image.
What is a good way to go about this? I feel like there should be some minimal code or script I could run.

The value you are seeing (3.4KB) includes the network overhead such as response headers.
As a side note, I am not sure what is the version of Chrome you are using but the transfer size (including response headers) and the resource size (i.e. the file size) are displayed separately for me:

Can't make python urllib urlretrieve work. Resulting image is corrupted

I've successfully written a code that go through several urls, find a specific image in each of them, and saves its address. now i want to download the image.
I'm using this.
def update(name,set,url):
urllib.urlretrieve(url,"c:/path/"+set+"/"+url)
it is currently working, but the images this code obtains can't be opened. i get a message that says that either i don't have the proper update or the windows viewer can't open it because it doesn't support it

Upload image with an in-memory stream to input using Pillow + WebDriver?

I'm getting an Image from URL with Pillow, and creating an stream (BytesIO/StringIO).
r = requests.get("http://i.imgur.com/SH9lKxu.jpg")
stream = Image.open(BytesIO(r.content))
Since I want to upload this image using an <input type="file" /> with selenium WebDriver. I can do something like this to upload a file:
self.driver.find_element_by_xpath("//input[#type='file']").send_keys("PATH_TO_IMAGE")
I would like to know If its possible to upload that image from a stream without having to mess with files / file paths... I'm trying to avoid filesystem Read/Write. And do it in-memory or as much with temporary files. I'm also Wondering If that stream could be encoded to Base64, and then uploaded passing the string to the send_keys function you can see above :$
PS: Hope you like the image :P

You seem to be asking multiple questions here.
First, how do you convert a a JPEG without downloading it to a file? You're already doing that, so I don't know what you're asking here.
Next, "And do it in-memory or as much with temporary files." I don't know what this means, but you can do it with temporary files with the tempfile library in the stdlib, and you can do it in-memory too; both are easy.
Next, you want to know how to do a streaming upload with requests. The easy way to do that, as explained in Streaming Uploads, is to "simply provide a file-like object for your body". This can be a tempfile, but it can just as easily be a BytesIO. Since you're already using one in your question, I assume you know how to do this.
(As a side note, I'm not sure why you're using BytesIO(r.content) when requests already gives you a way to use a response object as a file-like object, and even to do it by streaming on demand instead of by waiting until the full content is available, but that isn't relevant here.)
If you want to upload it with selenium instead of requests… well then you do need a temporary file. The whole point of selenium is that it's scripting a web browser. You can't just type a bunch of bytes at your web browser in an upload form, you have to select a file on your filesystem. So selenium needs to fake you selecting a file on your filesystem. This is a perfect job for tempfile.NamedTemporaryFile.
Finally, "I'm also Wondering If that stream could be encoded to Base64".
Sure it can. Since you're just converting the image in-memory, you can just encode it with, e.g., base64.b64encode. Or, if you prefer, you can wrap your BytesIO in a codecs wrapper to base-64 it on the fly. But I'm not sure why you want to do that here.

Get EXIF data without downloading whole image - Python

Is is possible to get the EXIF information of an image remotely and with only downloading the EXIF data?
From what I can understand about EXIF bytes in image files, the EXIF data is in the first few bytes of an image.
So the question is how to download only the first few bytes of a remote file, with Python? (Edit: Relying on HTTP Range Header is not good enough, as not all remote hosts support it, in which case full download will occur.)
Can I cancel the download after x bytes of progress, for example?

You can tell the web server to only send you parts of a file by setting the HTTP range header. See This answer for an example using urllib to partially download a file. So you could download a chunk of e.g. 1000 bytes, check if the exif data is contained in the chunk, and download more if you can't find the exif app1 header or the exif data is incomplete.

This depends on the image format heavily. For example, if you have a TIFF file, there is no knowing a priori where the EXIF data, if any, is within the file. It could be right after the header and before the first IFD, but this is unlikely. It could be way after the image data. Chances are it's somewhere in the middle.
If you want the EXIF information, extract that on the server (cache, maybe) and ship that down packaged up nicely instead of demanding client code do that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.