I'm coding a little snippet to fetch data from a web page, and I'm currently behind a HTTP/HTTPS proxy. The requests are created like this:
headers = {'Proxy-Connection': 'Keep-Alive',
'Connection':None,
'User-Agent':'curl/1.2.3',
}
r = requests.get("https://www.google.es", headers=headers, proxies=proxyDict)
At first, neither HTTP nor HTTPS worked, and the proxy returned 403 after the request. It was also weird that I could do HTTP/HTTPS requests with curl, fetching packages with apt-get or browsing the web. Having a look at Wireshark I noticed some differences between the curl request and the Requests one. After setting User-Agent to a fake curl version, the proxy instantly lets me do HTTP requests, so I supposed the proxy filter requests by User-Agent.
So, now I know why my code fails, and I can do HTTP requests, but the code keep on failing with HTTPS. I set the headers the same way as with HTTP, but after looking at Wireshark, no headers are sent in the CONNECT message, so the proxy sees no User-Agent and returns an ACCESS DENIED response.
I think that if only I could send the headers with the CONNECT message, I could do HTTPS requests easily, but I'm breaking my head around how to tell Requests that I want to send that headers.
Ok, so I found a way after looking at http.client. It's a bit lower level than using Requests but at least it work.
def HTTPSProxyRequest(method, host, url, proxy, header=None, proxy_headers=None, port=443):
https = http.client.HTTPSConnection(proxy[0], proxy[1])
https.set_tunnel(host, port, headers=proxy_headers)
https.connect()
https.request(method, url, headers=header)
response = https.getresponse()
return response.read(), response.status
# calling the function
HTTPSProxyRequest('GET','google.com', '/index.html', ('myproxy.com',8080))
Related
I am using the python requests library for a GET request.
The request goes via a proxy and I want to set a proxy header in the request (X-AUTH)
I could not find a way to set proxy header in requests. I want something similar to --proxy-header in curl. Setting it in normal headers does not seem to work.
import requests
proxies = {
"http": "http://myproxy:8000",
"https": "http://myproxy:8000",
}
r = requests.get("https://abc.xyz.com/some/endpoint", proxies=proxies, headers = {"X-AUTH": "mysecret"})
print(r.text)
I have changed the endpoints for privacy.
Curl call works when --proxy-header is used.
I'm using the following code to interact with a website:
session = requests.session()
get = session.get(LANDING_URL, headers=HEADERS)
post = session.post(LANDING_URL, headers=HEADERS, data=PARAMS)
I'm using the session object to preserve cookies between call, but the post request following the get request doesn't seem to use the session cookie. The below output is from pdb:
(Pdb) get.cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value=...)]>
(Pdb) post.cookies
<RequestsCookieJar[]>
(Pdb) session.cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value=...)]>
Does this mean that the post request isn't using the session cookie? If so, why not?
Page may use JavaScript to add cookies and requests can't run JavaScript.
Using get.cookies, post.cookies you display only cookies send from server, not send to server.
session should keep all cookies from previous requests and send them in POST request.
You can use httpbin.org. If you send GET request to httpbin.org/get or POST to httpbin.org/post then it sends you back (as JSON) all your headers, data, cookies, etc. There are other useful functions on httpbin.org
You can also install local proxy server like Charles or Man-In-The-Middle-Py and send request through proxy. You will see body and headers in proxy. You can use proxy with web browser and with your script to compare your requests with request from browser.
You can also check post.request.body, post.request.headers. I never used it but it should have body and headers sent to server.
I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.
The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.
I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.
One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.
Trying to send a simple get request via a proxy. I have the 'Proxy-Authorization' and 'Authorization' headers, don't think I needed the 'Authorization' header, but added it anyway.
import requests
URL = 'https://www.google.com'
sess = requests.Session()
user = 'someuser'
password = 'somepass'
token = base64.encodestring('%s:%s'%(user,password)).strip()
sess.headers.update({'Proxy-Authorization':'Basic %s'%token})
sess.headers['Authorization'] = 'Basic %s'%token
resp = sess.get(URL)
I get the following error:
requests.packages.urllib3.exceptions.ProxyError: Cannot connect to proxy. Socket error: Tunnel connection failed: 407 Proxy Authentication Required.
However when I change the URL to simple http://www.google.com, it works fine.
Do proxies use Basic, Digest, or some other sort of authentication for https? Is it proxy server specific? How do I discover that info? I need to achieve this using the requests library.
UPDATE
Its seems that with HTTP requests we have to pass in a Proxy-Authorization header, but with HTTPS requests, we need to format the proxy URL with the username and password
#HTTP
import requests, base64
URL = 'http://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<IP>:<PORT>}
token = base64.encodestring('%s:%s' %(user, password)).strip()
myheader = {'Proxy-Authorization': 'Basic %s' %token}
r = requests.get(URL, proxies = proxies, headers = myheader)
print r.status_code # 200
#HTTPS
import requests
URL = 'https://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<user>:<password>#<IP>:<PORT>}
r = requests.get(URL, proxies = proxy)
print r.status_code # 200
When sending an HTTP request, if I leave out the header and pass in a proxy formatted with user/pass, I get a 407 response.
When sending an HTTPS request, if I pass in the header and leave the proxy unformatted I get a ProxyError mentioned earlier.
I am using requests 2.0.0, and a Squid proxy-caching web server. Why doesn't the header option work for HTTPS? Why does the formatted proxy not work for HTTP?
The answer is that the HTTP case is bugged. The expected behaviour in that case is the same as the HTTPS case: that is, you provide your authentication credentials in the proxy URL.
The reason the header option doesn't work for HTTPS is that HTTPS via proxies is totally different to HTTP via proxies. When you route a HTTP request via a proxy, you essentially just send a standard HTTP request to the proxy with a path that indicates a totally different host, like this:
GET http://www.google.com/ HTTP/1.1
Host: www.google.com
The proxy then basically forwards this on.
For HTTPS that can't possibly work, because you need to negotiate an SSL connection with the remote server. Rather than doing anything like the HTTP case, you use the CONNECT verb. The proxy server connects to the remote end on behalf of the client, and from them on just proxies the TCP data. (More information here.)
When you attach a Proxy-Authorization header to the HTTPS request, we don't put it on the CONNECT message, we put it on the tunnelled HTTPS message. This means the proxy never sees it, so refuses your connection. We special-case the authentication information in the proxy URL to make sure it attaches the header correctly to the CONNECT message.
Requests and urllib3 are currently in discussion about the right place for this bug fix to go. The GitHub issue is currently here. I expect that the fix will be in the next Requests release.
I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text