I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text
Related
I would like to access a ressource with a particular url. Let's say I have only access to a PC (without admin rights) from which I cannot use the requests module due to different reasons.
Normally, I would address an API und perform HTTP GET and HTTP POST requests with:
import requests
url = r"https://httpbin.org/json"
r = requests.get(url)
If I would like to provide header and authorisation details, I would add
headers = {"Content-Type": "application/json"}
auth = ("username", "password")
r = requests.post(url, auth=auth, headers=headers)
as well as the payload in the data exchange format of the API (either JSON or XML).
Unfortunately, I cannot use the requests module on the aforementioned system. However, I can use the selenium module with the Internet Explorer webdriver (no Firefox and no Chrome).
I tried to access the url of the API with
from selenium import webdriver
driver = webdriver.Ie()
driver.get(url)
This does open an authentication popup, which I cannot access with the selenium "switch_to" functions. Ideally, I would like to perform a HTTP POST via selenium and provide authentication as well as header information. Would that be possible?
I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.
The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.
I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.
One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents
Using urllib2 and trying to get an https page, it keeps failing with
Invalid url, unable to resolve
The url is
https://www.domainsbyproxy.com/default.aspx
but I have this happening on multiple https sites.
I am using python 2.7, and below is the code I am using to setup the connection
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
opener.add_handler(urllib2.HTTPDefaultErrorHandler())
opener.addheaders = [('Accept-encoding', 'gzip')]
fetch_timeout = 12
response = opener.open(url, None, fetch_timeout)
The reason I am setting handlers manually is because I don't want redirects handled (which works fine). The above works fine for http requests, however https - fails.
Any clues?
You should be using HTTPSHandler instead of HTTPHandler
If you don't mind external libraries, consider the excellent requests module. It takes care of these quirks with urllib.
Your code, using requests is:
import requests
r = requests.get(url, headers={'Accept-encoding': 'gzip'}, timeout=12)
I'm writing a script in Python 3.1.2 that logs into a site and then begins to make requests. I can log in without any great difficulty, but after doing that the requests return an error stating I haven't logged in. My code looks like this:
import urllib.request
from http import cookiejar
from urllib.parse import urlencode
jar = cookiejar.CookieJar()
credentials = {'accountName': 'username', 'password': 'unenc_pw'}
credenc = urlencode(credentials)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(jar))
urllib.request.install_opener(opener)
req = opener.open('http://www.wowarmory.com/?app=armory?login&cr=true', credenc)
test = opener.open('http://www.wowarmory.com/auctionhouse/search.json')
print(req.read())
print(test.read())
The response to the first request is the page I expect to get when logging in.
The response to the second is:
b'{"error":{"code":10005,"error":true,"message":"You must log in."},"command":{"sort":"RARITY","reverse":false,"pageSize":20,"end":20,"start":0,"minLvl":0,"maxLvl":0,"id":0,"qual":0,"classId":-1,"filterId":"-1"}}'
Is there something I'm missing to use any cookie information I have from successful authentication for future requests?
I had this issue once. I can't get the cookie the cookie management working automatically. Frustrated me for days, I ended up handling the cookie manually. That is getting the content of 'Set-Cookie' from the response header, saving it somewhere safe. Subsequently, any request made to that server, I will set the 'Cookie' into the request header with the value I got earlier.