Can't download image with python urllib - python

I am trying to download an image with python and urllib.
This is my first attempt:
import urllib
url = "https://xxxxxxxxxxxxxxxxxxxxxxxxxx.jpg"
urllib.urlretrieve(url, "myimage.jpg")
The result is an empty (0 Byte) file called "myimage.jpg"
The image is accessible from browser, from the same link. So I tried change the use user agent, using this script I found:
from urllib import FancyURLopener
url = "https://xxxxxxxxxxxxxxxxxxxxxxxxxx.jpg"
class MyOpener(FancyURLopener, object):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
myopener.retrieve(url, 'myimage.jpg')
The result is again an empty (0 Byte) file called "myimage.jpg".
Additional notes:
The robots.txt file is not accessible from browser: "access denied error" code 403.
In the url there is the word: 'ssl'
What can I do?
EDIT: The image is linked from another web domain. I noticed that the image is accessible from browser only if the first time I opened the image from this specific web domain. If I clear the cookies the image become unaccessible.

It works if the URL exists:
import urllib
url = "https://www.lhorn.de/images/6cfYoU3.png"
png = urllib.urlretrieve(url, "nodejs-1995.png")

Related

Why are some image url can only display inside html <img src="urllink">, but cannot directly open through url in browser?

Recently, i am trying to download some image from a website.
I search the displayed image element inside html.
Then, I open the image url on new tab, but it returns 403 Forbidden page.
I copy the string and insert it into another pages html and the image can display successfully.
I want to ask about the reason of it, and what can i do to download the image.
(I am trying to download it through python request.get())
Thank you.
This web server checks the Referer header when you request the image. To successfully download the image, the Referer must be the page the image is on. It doesn't care about the User-Agent. I assume the image showed up when you put it in another page because your browser cached the image, and did not actually request it from the server again.
By using your browser's network monitor tool, you can see how your browser got the image's URL. In this case, the URL wasn't a part of the original html document. Your browser executed some JavaScript that unpacked the URL and inserted an img element into the div element with id="mangaBox". Because of this, you can't use vanilla requests, as it doesn't execute JavaScript. I used Requests-HTML.
The code below downloads the image from the link you gave in your comment, and saves it to disk:
import os, urllib
from requests_html import HTMLSession
session = HTMLSession()
session.headers.update({"User-Agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Referer": r"https://tw.manhuagui.com/comic/35275/481200.html"
})
url = r"https://tw.manhuagui.com/comic/35275/481200.html"
response = session.get(url)
print(response, len(response.content))
response.html.render()
img = response.html.find("img#mangaFile", first=True)
print("img element:", img)
url = img.attrs["src"]
print("image url:", url)
response = session.get(url)
print(response, len(response.content))
filename = os.path.basename(urllib.parse.urlsplit(url).path)
print("filename:", filename)
with open(filename, "wb") as f:
f.write(response.content)
Output:
<Response [200]> 6715
img element: <Element 'img' alt='在地下城寻找邂逅难道有错吗? 第00话' id='mangaFile' src='https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw' class=('mangaFile',) data-tag='mangaFile' style='display: block; transform: rotate(0deg); transform-origin: 50% 50% 0px;' imgw='907'>
image url: https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw
<Response [200]> 186386
filename: P0018.jpg.webp
For what it's worth, a whole heap of image URLs, in addition to the main image of the current page, are packed in the last script element of the original html document.
<script type="text/javascript">window["\x65\x76\x61\x6c"](function(p,a,c,k,e,d)...
Some websites block requests without a useragent, try this:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
requests.get(url, headers=headers)
Reference to Python requests. 403 Forbidden

urlretrieve not working for this site

I'm trying to download an image, however it does seem to work. Is it being blocked by ddos protection?
Here is the code:
urllib.request.urlretrieve("http://archive.is/Xx9t3/scr.png", "test.png")
Basically download that image as "test.png." I'm using python3 hence the urllib.request before urlretrieve.
import urllib.request
Have that as well.
Any way I can download the image? thanks!
For reasons that I cannot even imagine, the server requires a well known user agent. So you must pretend to use for example firefox and it will accept to send the image:
# first build a request object
req = urllib.request.Request("http://archive.is/Xx9t3/scr.png",
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 5.1; rv:43.0) Gecko/20100101 Firefox/43.0'})
#then use it
resp = urllib.request.urlopen(req)
with open("test.png","wb") as fd:
fd.write(resp.read())
Rather stupid, but when a server admin goes mad, just be as stupid as he is...
I'd advice you to use requests, basically the way you are trying to get the image is forbidden, check this:
import requests
import shutil
r = requests.get('http://archive.is/Xx9t3/scr.png', stream=True)
if r.status_code == 200:
with open("test.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
This snippet was adapted from here
The magic behind this is how the resource is retrieved, with requests that part is the stream=True line. Some servers are more restricted with this methods to pull some resources like media.

download zipfile with urllib2 fails

I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :
file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)
... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.
(Disclaimer : this is a free download, as seen on the band's website
Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :
'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'
However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.
I tried this with Python using the following code that replaces urlib with urllib2:
url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"
import urllib2
file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
bin_writer.write(data)
and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.
I did not open the .rar archives to inspect the two files.
This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.

Facebook stream API error works in Browser but not Server-side

If I enter this URL in a browser it returns to me the valid XML data that I am interested in scraping.
http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0
However, if I do it from the server-side, it doesn't work as it previously did. Now it just returns this error, which seems to be the default error message
{u'silentError': 0, u'errorDescription': u"Something went wrong. We're working on getting it fixed as soon as we can.", u'errorSummary': u'Oops', u'errorIsWarning': False, u'error': 1357010, u'payload': None}
here is the code in question, I've tried multiple User Agents, to no avail:
import urllib2
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'
uaheader = { 'User-Agent' : user_agent }
wallurl='http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0'
req = urllib2.Request(wallurl, headers=uaheader)
resp = urllib2.urlopen(req)
pageData=convertTextToUnicode(resp.read())
print pageData #and get that error
What would be the difference between the server calls and my own browser aside from User Agents and IP addresses?
I tried the above url in both chrome and firefox. It works on chrome but fails on firefox. On chrome, I am signed into facebook while on Firefox, I am not.
This could be the reason for this discrepancy. You will need to provide authentication in your urllib2 based script that you have posted.
There is a existing question on authentication with urllib2.

How can I change a user agent string programmatically?

I would like to write a program that changes my user agent string.
How can I do this in Python?
I assume you mean a user-agent string in an HTTP request? This is just an HTTP header that gets sent along with your request.
using Python's urllib2:
import urllib2
url = 'http://foo.com/'
# add a header to define a custon User-Agent
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req).read()
In urllib, it's done like this:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "MyStrangeUserAgent"
urllib._urlopener = AppURLopener()
and then just use urllib.urlopen normally. In urllib2, use req = urllib2.Request(...) with a parameter of headers=somedict to set all the headers you want (including user agent) in the new request object req that you make, and urllib2.urlopen(req).
Other ways of sending HTTP requests have other ways of specifying headers, of course.
Using Python you can use urllib to download webpages and use the version value to change the user-agent.
There is a very good example on http://wolfprojects.altervista.org/changeua.php
Here is an example copied from that page:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]
urllib2 is nice because it's built in, but I tend to use mechanize when I have the choice. It extends a lot of urllib2's functionality (though much of it has been added to python in recent years). Anyhow, if it's what you're using, here's an example from their docs on how you'd change the user-agent string:
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person#example.com")]
Best of luck.
As mentioned in the answers above, the user-agent field in the http request header can be changed using builtin modules in python such as urllib2. At the same time, it is also important to analyze what exactly the web server sees. A recent post on User agent detection gives a sample code and output, which gives a description of what the web server sees when a programmatic request is sent.
If you want to change the user agent string you send when opening web pages, google around for a Firefox plugin. ;) For example, I found this one. Or you could write a proxy server in Python, which changes all your requests independent of the browser.
My point is, changing the string is going to be the easy part; your first question should be, where do I need to change it? If you already know that (at the browser? proxy server? on the router between you and the web servers you're hitting?), we can probably be more helpful. Or, if you're just doing this inside a script, go with any of the urllib answers. ;)
Updated for Python 3.2 (py3k):
import urllib.request
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
url = 'http://www.google.com'
request = urllib.request.Request(url, b'', headers)
response = urllib.request.urlopen(request).read()

Categories

Resources