I'm learning to use proxies in making requests, but I've run into a big issue, which is primarily that it seems requests doesn't care if a provided proxy is valid or not. This makes it almost impossible to tell if something is actually working or not and I'm honestly at a loss for what to do. The documentation on proxies provided by requests is very minimal.
My code grabs a User-Agent string and a proxy from a list like so:
proxy = {"https": "https://%s:%s#%s" % (USERNAME, PASSWORD, random.choice(PROXY_LIST))}
headers = {"User-Agent": random.choice(USER_AGENT_LIST)}
return partial(requests.get, proxies=proxy, headers=headers)
an example of a PROXY_LIST entry: 185.46.87.199:8080
The issue is that I can change the username, change the password, etc... and requests doesn't seem to notice/care. A large portion of all the requests being sent aren't going through a proxy at all. Is there any way to test proxies? See if a request is actually going through a provided proxy? Really any tools for debugging and/or fixing this would be immensely appreciated.
After suggestion by larsks, changed the logging level to DEBUG and got the following output:
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): mobile.twitter.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /motivesbylorenr HTTP/1.1" 404 1318
unchanged whether auth is correct or incorrect, and no mention of proxy in the debug information. Again, requests are going through my local IP.
Requests logs debugging information at the DEBUG priority, so if you enable debug logging via the logging module you can see a variety of diagnostics. For example:
>>> import logging
>>> logging.basicConfig(level='DEBUG')
With that in place, I can set run:
>>> import requests
>>> s = requests.Session()
>>> s.headers={'user-agent': 'my-test-script'}
>>> s.proxies={'http': 'http://localhost:8123',
... 'https': 'http://localhost:8123'}
>>> s.get('http://mentos.com')
And see:
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): localhost
DEBUG:requests.packages.urllib3.connectionpool:"GET http://mentos.com/ HTTP/1.1" 301 0
DEBUG:requests.packages.urllib3.connectionpool:"GET http://us.mentos.com HTTP/1.1" 200 32160
<Response [200]>
That clearly shows the connection to the proxy.
This is hopefully enough to get you started. I'm using a Session
here, but your solution using partial would behave similarly.
Compare the above output to the log message when requests is not using a proxy:
>>> requests.get('http://mentos.com')
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mentos.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 301 0
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): us.mentos.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 10566
<Response [200]>
Here, we see the initial connection opened to the remote site, rather
than the proxy, and the GET requests do not include the hostname.
Update
The above, with HTTPS URLs:
>>> response = s.get('https://google.com')
>>> response
<Response [200]>
Note that I am setting both the http and https keys in the proxies dictionary.
Related
I am trying to hit the Atlassian Confluence REST API using python requests.
I've successfully called a GET api, but when I call the PUT to update a confluence page, it returns 200, but didn't update the page.
I used chrome::YARC to verify that the API was working properly (which it was). After a while trying to debug it, I reverted to try using urllib3, which worked just fine.
I'd really like to use requests, but I can't for the life of me figure this one out after hours and hours of trying to debug, Google, etc.
I'm running Mac/Python3:
$ uname -a
Darwin mylaptop.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
$ python3 --version
Python 3.6.1
Here's my code that shows all three ways I'm trying this (two requests and one urllib3):
def update(self, spaceKey, pageTitle, newContent, contentType='storage'):
if contentType not in ('storage', 'wiki', 'plain'):
raise ValueError("Invalid contentType={}".format(contentType))
# Get current page info
self._refreshPage(spaceKey, pageTitle) # I retrieve it before I update it.
orig_version = self.version
# Content already same as requested content. Do nothing
if self.wiki == newContent:
return
data_dict = {
'type' : 'page',
'version' : {'number' : self.version + 1},
'body' : {
contentType : {
'representation' : contentType,
'value' : str(newContent)
}
}
}
data_json = json.dumps(data_dict).encode('utf-8')
put = 'urllib3' #for now until I figure out why requests.put() doesn't work
enable_http_logging()
if put == 'requests':
r = self._cs.api.content(self.id).PUT(json=data_dict)
r.raise_for_status()
elif put == 'urllib3':
urllib3.disable_warnings() # I know, you can quit your whining now!!!
headers = { 'Content-Type' : 'application/json;charset=utf-8' }
auth_header = urllib3.util.make_headers(basic_auth=":".join(self._cs.session.auth))
headers = {**headers, **auth_header}
http = urllib3.PoolManager()
r = http.request('PUT', str(self._cs.api.content(self.id)), body=data_json, headers=headers)
else:
raise ValueError("Huh? Unknown put type: {}".format(put))
enable_http_logging(False)
# Verify page was updated
self._refreshPage(spaceKey, pageTitle) # Check for changes
if self.version != orig_version + 1:
raise RuntimeError("Page not updated. Still at version {}".format(self.version))
if self.wiki != newContent:
raise RuntimeError("Page version updated, but not content.")
Any help would be great.
Update 1: Adding request dump
-----------START-----------
PUT http://confluence.myco.com/rest/api/content/101904815
User-Agent: python-requests/2.18.4
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 141
Content-Type: application/json
Authorization: Basic <auth-token-here>==
b'{"type": "page", "version": {"number": 17}, "body": {"storage": {"representation": "storage", "value": "new body here version version 17"}}}'
requests never went back to PUT (Bug???)
What you're observing is requests behaving consistently with web browsers: reacting to HTTP 302 redirect with a GET request.
From Wikipedia:
The user agent (e.g. a web browser) is invited by a response with this code to make a second, otherwise identical, request to the new URL specified in the location field.
(...)
Many web browsers implemented this code in a manner that violated this standard, changing the request type of the new request to GET, regardless of the type employed in the original request (e.g. POST)
(...)
As a consequence, the update of RFC 2616 changes the definition to allow user agents to rewrite POST to GET.
So this behaviour is consistent with RFC 2616. I don't think we can say which of the two libraries behaves "more correctly".
Looks like a difference in how the requests and urllib3 modules deal with switching from http to https. (See #Kos answer above). Here's what I found when I checked the debug logs.
So I got to thinking after #JonClements suggested I send him the Response dump. After doing some research I found the magic runs to enable debugging for requests and urllib3 (See here).
In looking at the diffs from both, I noticed that they were being redirected from http to https for my companies confluence site:
urllib3:
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): confluence.myco.com
DEBUG:urllib3.connectionpool:http://confluence.myco.com:80 "PUT /rest/api/content/101906196 HTTP/1.1" 302 237
DEBUG:urllib3.util.retry:Incremented Retry for (url='http://confluence.myco.com/rest/api/content/101906196'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
INFO:urllib3.poolmanager:Redirecting
http://confluence.myco.com/rest/api/content/101906196 ->
https://confluence.myco.com/rest/api/content/101906196
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): confluence.myco.com
DEBUG:urllib3.connectionpool:https://confluence.myco.com:443 "PUT /rest/api/content/101906196 HTTP/1.1" 200 None
while requests tried with my PUT and then after redirecting went to GET:
DEBUG:urllib3.connectionpool:http://confluence.myco.com:80 "PUT /rest/api/content/101906196 HTTP/1.1" 302 237
DEBUG:urllib3.connectionpool:https://confluence.myco.com:443 "GET /rest/api/content/101906196 HTTP/1.1" 200 None
requests never went back to PUT
I changed my initial url from http: to https: and everything worked fine.
I am trying to get a response from an internal url which I can access through my laptop using a web-browser.
s = requests.Session()
r = s.get(url_1, auth=auth, verify=False)
print r.text
the reply i get is: 401 - unauthorized.
It's obviously going to be difficult to debug an HTTP 401 Unauthorized as we don't have access to the internal URL. Your code looks correct to me so I'm assuming this is a real 401 Unauthorized which means the request has incorrect authentication credentials. My advice would be to make sure you have reviewed the Python Requests docs on authentication and consider that your request is likely going through a proxy so the Requests docs on proxy config might be helpful.
I'm coding a little snippet to fetch data from a web page, and I'm currently behind a HTTP/HTTPS proxy. The requests are created like this:
headers = {'Proxy-Connection': 'Keep-Alive',
'Connection':None,
'User-Agent':'curl/1.2.3',
}
r = requests.get("https://www.google.es", headers=headers, proxies=proxyDict)
At first, neither HTTP nor HTTPS worked, and the proxy returned 403 after the request. It was also weird that I could do HTTP/HTTPS requests with curl, fetching packages with apt-get or browsing the web. Having a look at Wireshark I noticed some differences between the curl request and the Requests one. After setting User-Agent to a fake curl version, the proxy instantly lets me do HTTP requests, so I supposed the proxy filter requests by User-Agent.
So, now I know why my code fails, and I can do HTTP requests, but the code keep on failing with HTTPS. I set the headers the same way as with HTTP, but after looking at Wireshark, no headers are sent in the CONNECT message, so the proxy sees no User-Agent and returns an ACCESS DENIED response.
I think that if only I could send the headers with the CONNECT message, I could do HTTPS requests easily, but I'm breaking my head around how to tell Requests that I want to send that headers.
Ok, so I found a way after looking at http.client. It's a bit lower level than using Requests but at least it work.
def HTTPSProxyRequest(method, host, url, proxy, header=None, proxy_headers=None, port=443):
https = http.client.HTTPSConnection(proxy[0], proxy[1])
https.set_tunnel(host, port, headers=proxy_headers)
https.connect()
https.request(method, url, headers=header)
response = https.getresponse()
return response.read(), response.status
# calling the function
HTTPSProxyRequest('GET','google.com', '/index.html', ('myproxy.com',8080))
Trying to send a simple get request via a proxy. I have the 'Proxy-Authorization' and 'Authorization' headers, don't think I needed the 'Authorization' header, but added it anyway.
import requests
URL = 'https://www.google.com'
sess = requests.Session()
user = 'someuser'
password = 'somepass'
token = base64.encodestring('%s:%s'%(user,password)).strip()
sess.headers.update({'Proxy-Authorization':'Basic %s'%token})
sess.headers['Authorization'] = 'Basic %s'%token
resp = sess.get(URL)
I get the following error:
requests.packages.urllib3.exceptions.ProxyError: Cannot connect to proxy. Socket error: Tunnel connection failed: 407 Proxy Authentication Required.
However when I change the URL to simple http://www.google.com, it works fine.
Do proxies use Basic, Digest, or some other sort of authentication for https? Is it proxy server specific? How do I discover that info? I need to achieve this using the requests library.
UPDATE
Its seems that with HTTP requests we have to pass in a Proxy-Authorization header, but with HTTPS requests, we need to format the proxy URL with the username and password
#HTTP
import requests, base64
URL = 'http://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<IP>:<PORT>}
token = base64.encodestring('%s:%s' %(user, password)).strip()
myheader = {'Proxy-Authorization': 'Basic %s' %token}
r = requests.get(URL, proxies = proxies, headers = myheader)
print r.status_code # 200
#HTTPS
import requests
URL = 'https://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<user>:<password>#<IP>:<PORT>}
r = requests.get(URL, proxies = proxy)
print r.status_code # 200
When sending an HTTP request, if I leave out the header and pass in a proxy formatted with user/pass, I get a 407 response.
When sending an HTTPS request, if I pass in the header and leave the proxy unformatted I get a ProxyError mentioned earlier.
I am using requests 2.0.0, and a Squid proxy-caching web server. Why doesn't the header option work for HTTPS? Why does the formatted proxy not work for HTTP?
The answer is that the HTTP case is bugged. The expected behaviour in that case is the same as the HTTPS case: that is, you provide your authentication credentials in the proxy URL.
The reason the header option doesn't work for HTTPS is that HTTPS via proxies is totally different to HTTP via proxies. When you route a HTTP request via a proxy, you essentially just send a standard HTTP request to the proxy with a path that indicates a totally different host, like this:
GET http://www.google.com/ HTTP/1.1
Host: www.google.com
The proxy then basically forwards this on.
For HTTPS that can't possibly work, because you need to negotiate an SSL connection with the remote server. Rather than doing anything like the HTTP case, you use the CONNECT verb. The proxy server connects to the remote end on behalf of the client, and from them on just proxies the TCP data. (More information here.)
When you attach a Proxy-Authorization header to the HTTPS request, we don't put it on the CONNECT message, we put it on the tunnelled HTTPS message. This means the proxy never sees it, so refuses your connection. We special-case the authentication information in the proxy URL to make sure it attaches the header correctly to the CONNECT message.
Requests and urllib3 are currently in discussion about the right place for this bug fix to go. The GitHub issue is currently here. I expect that the fix will be in the next Requests release.
I have a Django app that's serving up a RESTful API using tasty-pie.
I'm using Django's development runserver to test.
When I access it via a browser it works fine, and using Curl also works fine:
curl "http://localhost:8000/api/v1/host/?name__regex=&format=json"
On the console with runserver, I see:
[02/Oct/2012 17:24:20] "GET /api/v1/host/?name__regex=&format=json HTTP/1.1" 200 2845
However, when I try to use the Python requests module (http://docs.python-requests.org/en/latest/), I get a 404 as the output:
>>> r = requests.get('http://localhost:8000/api/v1/host/?name__regex=&format=json')
>>> r
<Response [404]>
or:
>>> r = requests.get('http://localhost:8000/api/v1/host/?name__regex=&format=json')
>>> r
<Response [404]>
Also, on the Django runserver console, I see:
[02/Oct/2012 17:25:01] "GET http://localhost:8000/api/v1/host/?name__regex=&format=json HTTP/1.1" 404 161072
For some reason, when I use requests, it prints out the whole request URL, including localhost - but not when I use the browser, or curl.
I'm assuming this is something to do with the encoding, user-agent or request type it's sending?
I'm not very familiar with Requests, but I think your encoding idea might be sound. That is Requests might process the URL somehow? Perhaps instead of passing everything in the URL directly, try doing what Requests docs suggest:
request_params = { 'name_regex' : '', 'format' : 'json' }
r = requests.get( 'http://localhost:8000/api/v1/host/', params = request_params )