Connect with page (Error 403) - python

I can't connect with page. Here is my code and error witch I have:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import urllib
someurl = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET"
req = Request(someurl)
try:
response = urllib.request.urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
Error code: 403

Some websites require a browser-like "User-Agent" header, other requires specific cookies. In this case, I found out by trial and error that both are required. What you need to do is:
Send an initial request with a browser-like user-agent. This will fail with 403, but you will also obtain a valid cookie in the response.
Send a second request with the same user-agent and the cookie that you got before.
In code:
import urllib.request
from urllib.error import URLError
# This handler will store and send cookies for us.
handler = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(handler)
# Browser-like user agent to make the website happy.
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
request = urllib.request.Request(url, headers=headers)
for i in range(2):
try:
response = opener.open(request)
except URLError as exc:
print(exc)
print(response)
# Output:
# HTTP Error 403: Forbidden (expected, first request always fails)
# <http.client.HTTPResponse object at 0x...> (correct 200 response)
Or, if you prefer, using requests:
import requests
session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
for i in range(2):
response = session.get(url, cookies=jar, headers=headers)
print(response)
# Output:
# <Response [403]>
# <Response [200]>

You can use http.client. First, you need to open a connection with the server. And, after, make a GET request. Like this:
import http.client
conn = http.client.HTTPConnection("genecards.org:80")
conn.request("GET", "/cgi-bin/carddisp.pl?gene=MET")
try:
response = conn.getresponse().read().decode("UTF-8")
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")

Related

Python urllib (with proxy setting) returns wrong code, Any advice please?

Using urllib I am checking list of urls where my machine is located behind an squid web proxy, but somehow I can't manage proxy setting correctly in the requests which I am getting 404 instead of 200 when calling the function in a for loop or via a map function.
however single requests work fine!
from multiprocessing import Pool
import urllib.error
import urllib.request
proxy_host = "192.168.1.1:3128"
urls = ['https://www.youtube.com/watch?v=XqZsoesa55w',
'https://www.youtube.com/watch?v=GR2o6k8aPlI',
'https://stackoverflow.com/']
single request example (works fine):
req = urllib.request.Request(
url = url[0],
data = None,
headers = {
'User-Agent': 'Mozilla/5.0'
})
req.set_proxy(proxy_host, 'http')
conn = urllib.request.urlopen(req)
conn.getcode() # --> returns 200
This return true http code for single url check.
batch request example (returns wrong http status code):
Function:
def check_url(url):
req = urllib.request.Request(
url = url,
data = None,
headers = {
'User-Agent': 'Mozilla/5.0'
})
req.set_proxy(proxy_host, 'http')
try:
conn = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
return [str(e), url]
except urllib.error.URLError as e:
return [str(e), url]
except ValueError as e:
return [str(e), url]
else:
if conn:
return conn.getcode()
else:
return 'Unknown Status!'
for url in urls:
check_url(url)
# returns:
>>>404
>>>404
>>>404
p = Pool(processes=20)
p.map(check_url,urls)
#returns:
>>>[404, 404, 404]

How to check HTTP errors for more than two URLs?

Question: I've 3 URLS - testurl1, testurl2 and testurl3. I'd like to try testurl1 first, if I get 404 error then try testurl2, if that gets 404 error then try testurl3. How to achieve this? So far I've tried below but that works only for two url, how to add support for third url?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
url1 = ('http://testurl2')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()
Another job for plain old for loop:
for url in testurl1, testurl2, testurl3
req = Request(url)
try:
response = urlopen(req)
except HttpError as err:
if err.code == 404:
continue
raise
else:
# do what you want with successful response here (or outside the loop)
break
else:
# They ALL errored out with HTTPError code 404. Handle this?
raise err
Hmmm maybe something like this?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
try:
url1 = ('http://testurl2')
except HTTPError, URLError:
url1 = ('http://testurl3')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()

Get variable outside exception Python

I am calling an API with the urllib. When something is not as expected, the API throws an error at the user (E.G. HTTP Error 415: Unsupported Media Type). But next to that, the API returns a JSON with more information. I would like to pass that json to the exception and parse it there, so I can give information to the user about the error.
Is that possible? And if, how is it done?
Extra info:
Error: HTTPError
--EDIT--
On request, here is some code (I want to read resp in the exception):
def _sendpost(url, data=None, filetype=None):
try:
global _auth
req = urllib.request.Request(url, data)
req.add_header('User-Agent', _useragent)
req.add_header('Authorization', 'Bearer ' + _auth['access_token'])
if filetype is not None:
req.add_header('Content-Type', filetype)
resp = urllib.request.urlopen(req, data)
data = json.loads(resp.read().decode('utf-8'), object_pairs_hook=OrderedDict)
except urllib.error.HTTPError as e:
print(e)
return data
--EDIT 2--
I do not want to use extra library's/modules. As I do not control the target machines.
Code
import urllib.request
import urllib.error
try:
request = urllib.request.urlopen('https://api.gutefrage.net')
response = urllib.urlopen(request)
except urllib.error.HTTPError as e:
error_message = e.read()
print(error_message)
Output
b'{"error":{"message":"X-Api-Key header is missing or invalid","type":"API_REQUEST_FORBIDDEN"}}'
Not asked but with module json you could convert it to dict via
import json
json.loads(error_message.decode("utf-8"))
Which gives you the dict out of the byte string.
If you're stuck with using urllib, then you can use the error to read the text of the response, and load that into JSON.
from urllib import request, error
import json
try:
req = urllib.request.Request(url, data)
req.add_header('User-Agent', _useragent)
req.add_header('Authorization', 'Bearer ' + _auth['access_token'])
if filetype is not None:
req.add_header('Content-Type', filetype)
resp = urllib.request.urlopen(req, data)
data = json.loads(resp.read().decode('utf-8'), object_pairs_hook=OrderedDict)
except error.HTTPError as e:
json_response = json.loads(e.read().decode('utf-8'))
If you're not stuck to urllib, I would highly recommend you use the requests module instead of urllib. With that, you can have something like this instead:
response = requests.get("http://www.example.com/api/action")
if response.status_code == 415:
response_json = response.json()
requests doesn't throw an exception when it encounters a non-2xx series response code; instead it returns the response anyway with the status code added.
You can also add headers and parameters to these requests:
headers = {
'User-Agent': _useragent,
'Authorization': 'Bearer ' + _auth['access_token']
}
response = requests.get("http://www.example.com/api/action", headers=headers)

Python 3 urllib.request.urlopen

How can I avoid exceptions from urllib.request.urlopen if response.status_code is not 200? Now it raise URLError or HTTPError based on request status.
Is there any other way to make request with python3 basic libs?
How can I get response headers if status_code != 200 ?
Use try except, the below code:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://www.111cn.net /")
try:
response = urlopen(req)
except HTTPError as e:
# do something
print('Error code: ', e.code)
except URLError as e:
# do something
print('Reason: ', e.reason)
else:
# do something
print('good!')
The docs state that the exception type, HTTPError, can also be treated as a HTTPResponse. Thus, you can get the response body from an error response as follows:
import urllib.request
import urllib.error
def open_url(request):
try:
return urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
# "e" can be treated as a http.client.HTTPResponse object
return e
and then use as follows:
result = open_url('http://www.stackoverflow.com/404-file-not-found')
print(result.status) # prints 404
print(result.read()) # prints page contents
print(result.headers.items()) # lists headers
I found a solution from py3 docs
>>> import http.client
>>> conn = http.client.HTTPConnection("www.python.org")
>>> # Example of an invalid request
>>> conn.request("GET", "/parrot.spam")
>>> r2 = conn.getresponse()
>>> print(r2.status, r2.reason)
404 Not Found
>>> data2 = r2.read()
>>> conn.close()
https://docs.python.org/3/library/http.client.html#examples

Getting error headers with urllib2

I need to send a PUT request to a web service and get some data out of error headers that is the expected result of the request. The code goes like this:
Request = urllib2.Request(destination_url, headers=headers)
Request.get_method = lambda: 'PUT'
try:
Response = urllib2.urlopen(Request)
except urllib2.HTTPError, e:
print 'Error code: ', e.code
print e.read()
I get Error 308 but response is empty and I'm not getting any data out of HTTPError. Is there a way to get HTTP headers while getting an HTTP error?
e has undocumented headers and hdrs properties that contains the HTTP headers sent by the server.
By the way, 308 is not a valid HTTP status code.

Categories

Resources