can't download image with python - python

try to download images with python
but only this picture can't download it
i don't know the reason cause when i run it, it just stop just nothing happen
no image , no error code ...
here's the code plz tell me the reason and solution plz..
import urllib.request
num=404
def down(URL):
fullname=str(num)+"jpg"
urllib.request.urlretrieve(URL,fullname)
im="https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg"
down(im)

this code will work for you try to change the url that you use and see result :
import requests
pic_url = "https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg"
cookies = dict(BCPermissionLevel='PERSONAL')
with open('aa.jpg', 'wb') as handle:
response = requests.get(pic_url, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies,stream=True)
if not response.ok:
print (response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

What #MoetazBrayek says in their comment (but not answer) is correct: the website you're querying is blocking the request.
It's common for sites to block requests based on user-agent or referer: if you try to curl https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg you will get an HTTP error (403 Access Denied):
❯ curl -I https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg
HTTP/2 403
Apparently The Sun wants a browser's user-agent, and specifically the string "mozilla" is enough to get through:
❯ curl -I -A mozilla https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg
HTTP/2 200
You will have to either switch to the requests package or replace your url string with a proper urllib.request.Request object so you can customise more pieces of the request. And apparently urlretrieve does not support Request objects so you will also have to use urlopen:
req = urllib.request.Request(URL, headers={'User-Agent': 'mozilla'})
res = urllib.request.urlopen(req)
assert res.status == 200
with open(filename, 'wb') as out:
shutil.copyfileobj(res, out)

Related

401 response with Python requests

I'am using library "requests" for get image from url ('http://any_login:any_password#10.10.9.2/ISAPI/Streaming/channels/101/picture?snapShotImageType=JPEG'), but response error with code 401. Thats url from my rtsp camera.
I try using 'HTTPBasicAuth', 'HTTPDigestAuth' and 'HTTPProxyAuth'. But it's not working.
import requests
from requests.auth import HTTPBasicAuth
url = "http://any_login:any_password#10.10.9.2/ISAPI/Streaming/channels/101/picture?snapShotImageType=JPEG"
response = requests.get(url, auth=requests.auth.HTTPBasicAuth("any_login", "any_password"))
if response.status_code == 200:
with open("sample.jpg", 'wb') as f:
f.write(response.content)
I expected the output of image file from rtsp flow, but I got error code 401.
Given your username I suspect your password may contain non-ASCII characters. I had a similar issue with a password containing diacritics.
This worked :
curl -u user:pwd --basic https://example.org
This (and variations) throwed 401 Unauthorized :
import requests
requests.get("https://example.org", auth=requests.auth.HTTPBasicAuth("user","pwd"))
Changing the password to ASCII only characters solved the issue.

How to get HTML content of 404 error page using python?

I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –

Python request resulting in blank response

I'm relatively new to Python so would like some help, I've created a script which simply use the request library and basic auth to connect to an API and returns the xml or Json result.
# Imports
import requests
from requests.auth import HTTPBasicAuth
# Set variables
url = "api"
apiuser = 'test'
apipass = 'testpass'
# CALL API
r = requests.get(url, auth=HTTPBasicAuth(apiuser, apipass))
# Print Statuscode
print(r.status_code)
# Print XML
xmlString = str(r.text)
print(xmlString)
if but it returns a blank string.
If I was to use a browser to call the api and enter the cretentials I get the following response.
<Response>
<status>SUCCESS</status>
<callId>99999903219032190321</callId>
<result xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Dummy">
<authorFullName>jack jones</authorFullName>
<authorOrderNumber>1</authorOrderNumber>
</result>
</Response>
Can anyone tell me where I'm going wrong.
What API are you connecting to?
Try adding a user-agent to the header:
r = requests.get(url, auth=HTTPBasicAuth(apiuser, apipass), headers={'User-Agent':'test'})
Although this is not an exact answer for the OP, it may solve the issue for someone having a blank response from python-requests.
I was getting a blank response because of the wrong content type. I was expecting an HTML rather than a JSON or a login success. The correct content-type for me was application/x-www-form-urlencoded.
Essentially I had to do the following to make my script work.
data = 'arcDate=2021/01/05'
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
}
r = requests.post('https://www.deccanherald.com/getarchive', data=data, headers=headers)
print(r.status_code)
print(r.text)
Learn more about this in application/x-www-form-urlencoded or multipart/form-data?
Run this and see what responses you get.
import requests
url = "https://google.com"
r = requests.get(url)
print(r.status_code)
print(r.json)
print(r.text)
When you start having to pass things in your GET, PUT, DELETE, OR POST requests, you will add it in the request.
url = "https://google.com"
headers = {'api key': 'blah92382377432432')
r = requests.get(url, headers=headers)
Then you should see the same type of responses. Long story short,
Print(r.text) to see the response, then you once you see the format of the response you get, you can move it around however you want.
I have an empty response only when the authentication failed or is denied.
The HTTP status is still ≤ 400.
However, in the header you can find :
'X-Seraph-LoginReason': 'AUTHENTICATED_FAILED'
or
'X-Seraph-LoginReason': 'AUTHENTICATED_DENIED'
If the request is empty, not even a status code I could suggest waiting some time between printing. Maybe the server is taking time to return the response to you.
import time
time.sleep(5)
Not the nicest thing, but it's worth trying
How can I make a time delay in Python?
I guess there are no errors during execution
EDIT: nvm, you mentioned that you got a status code, I thought you were literally geting nothing.
On the side, if you are using python3 you have to use Print(), it replaced Print

How to POST a local file using urllib2 in Python?

I am a complete Python noob and am trying to run cURL equivalents using urllib2. What I want is a Python script that, when run, will do the exact same thing as the following cURL command in Terminal:
curl -k -F docfile=#myLocalFile.csv http://myWebsite.com/extension/extension/extension
I found the following template on a tutorial page:
import urllib
import urllib2
url = "https://uploadWebsiteHere.com"
data = "{From: 'sender#email.com', To: 'recipient#email.com', Subject: 'Postmark test', HtmlBody: 'Hello dear Postmark user.'}"
headers = { "Accept" : "application/json",
"Conthent-Type": "application/json",
"X-Postmark-Server-Token": "abcdef-1234-46cc-b2ab-38e3a208ab2b"}
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
but I am completely lost on the 'data' and 'headers' vars. The urllib2 documentation (https://docs.python.org/2/library/urllib2.html) defines the 'data' input as "a string specifying additional data to send to the server" and the 'headers' input as "a dictionary". I am totally out of my depth in trying to follow this documentation and do not see why a dictionary is necessary when I could accomplish this same task in terminal by only specifying the file and URL. Thoughts, please?
The data you are posting doesn't appear to be valid JSON. Assuming the server is expecting valid JSON, you should change that.
Your curl invocation does not pass any optional headers, so you shouldn't need to provide much in the request. If you want to verify the exact headers you could add -vi to the curl invocation and directly match them in the Python code. Alternatively, this works for me:
import urllib2
url = "http://localhost:8888/"
data = '{"From": "sender#email.com", "To": "recipient#email.com", "Subject": "Postmark test", "HtmlBody": "Hello dear Postmark user."}'
headers = {
"Content-Type": "application/json"
}
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
It probably is in your best interest to switch over to using requests, but for something this simple the standard library urllib2 can be made to work.
What I want is a Python script that, when run, will do the exact same thing as the following cURL command in Terminal:
$ curl -k -F docfile=#myLocalFile.csv https://myWebsite.com/extension...
curl -F sends the file using multipart/form-data content type. You could reproduce it easily using requests library:
import requests # $ pip install requests
with open('myLocalFile.csv','rb') as input_file:
r = requests.post('https://myWebsite.com/extension/...',
files={'docfile': input_file}, verify=False)
verify=False is to emulate curl -k.

In Python why does urllib.urlopen make Google give an http status "302 Moved"?

Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.

Categories

Resources