Not receiving headers Scrapy ProxyMesh - python

I am quite new to Scrapy / ProxyMesh.
My request to the Proxymesh server seems to be working as I see my bandwith consumption on the ProxyMesh website, and the meta.proxy is correct in my logs.
However, when I log the response headers in Scrapy, I do not receive the X-Proxymesh-IP that I am supposed to receive.
Here is my code. What am I doing wrong?
This is my middleware
class Proxymesh(object):
def __init__(self):
logging.debug('Initialized Proxymesh middleware')
self.proxy_ip = 'http://host:port'
def process_request(self, request, spider):
logging.debug('Processing request through proxy IP: ' + self.proxy_ip)
request.meta['proxy'] = self.proxy_ip
These are my settings in my spider
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"projectName.middlewares.proxymesh.Proxymesh" : 1,
}
This is what the response headers look like
['Set-Cookie']:['__cfduid=d88d4e4cb7... HttpOnly']
['Vary']:['User-Agent,Accept-Encoding']
['Server']:['cloudflare-nginx']
['Date']:['Thu, 19 Oct 2017 10...38:10 GMT']
['Cf-Ray']:['3b031b30cbef1565-CDG']
['Content-Type']:['text/html; charset=UTF-8']
Thank you for your help

Don't know if this relevant anymore but I'm going to post it here. There's an issue with proxymesh and scrapy or python requests.
When connecting to a proxy, a CONNECT request is sent to the proxy service in order to create a tunnel which will forward the actual request.
If the request is successful, proxymesh adds the X-Proxymesh-IP in the CONNECT requests's confirmation response. This is header totally missed by scrapy as it only takes into consideration the response headers of the actual request.
This only happens to HTTPS requests because the content of the actual request is encrypted.
References:
https://docs.proxymesh.com/article/74-proxy-server-headers-over-https
https://bugs.python.org/issue24964?fbclid=IwAR1c88hpLu2OdmEXlwfZfb2n8lMIqT8JvjLeO7pzsvFEiZBVlfJNpYZ4aFk
https://github.com/requests/requests/issues/3061?fbclid=IwAR34XDJa7dJqNpH33LRlvpoRHpaZJhVl75zXfFkEuBa7IjOVCoIxecW0zhw

Maybe you need to do this too?
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
}
And also in your callback function, are you sure you are printing response.headers

Related

POST request not sending session cookie

I'm using the following code to interact with a website:
session = requests.session()
get = session.get(LANDING_URL, headers=HEADERS)
post = session.post(LANDING_URL, headers=HEADERS, data=PARAMS)
I'm using the session object to preserve cookies between call, but the post request following the get request doesn't seem to use the session cookie. The below output is from pdb:
(Pdb) get.cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value=...)]>
(Pdb) post.cookies
<RequestsCookieJar[]>
(Pdb) session.cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value=...)]>
Does this mean that the post request isn't using the session cookie? If so, why not?
Page may use JavaScript to add cookies and requests can't run JavaScript.
Using get.cookies, post.cookies you display only cookies send from server, not send to server.
session should keep all cookies from previous requests and send them in POST request.
You can use httpbin.org. If you send GET request to httpbin.org/get or POST to httpbin.org/post then it sends you back (as JSON) all your headers, data, cookies, etc. There are other useful functions on httpbin.org
You can also install local proxy server like Charles or Man-In-The-Middle-Py and send request through proxy. You will see body and headers in proxy. You can use proxy with web browser and with your script to compare your requests with request from browser.
You can also check post.request.body, post.request.headers. I never used it but it should have body and headers sent to server.

Python - How to handle HTTPS request with (Urllib2 + SSL) though a HTTP proxy

I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.
The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.
I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.
One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.

HTTPS requests are sent without headers with Python's Requests

I'm coding a little snippet to fetch data from a web page, and I'm currently behind a HTTP/HTTPS proxy. The requests are created like this:
headers = {'Proxy-Connection': 'Keep-Alive',
'Connection':None,
'User-Agent':'curl/1.2.3',
}
r = requests.get("https://www.google.es", headers=headers, proxies=proxyDict)
At first, neither HTTP nor HTTPS worked, and the proxy returned 403 after the request. It was also weird that I could do HTTP/HTTPS requests with curl, fetching packages with apt-get or browsing the web. Having a look at Wireshark I noticed some differences between the curl request and the Requests one. After setting User-Agent to a fake curl version, the proxy instantly lets me do HTTP requests, so I supposed the proxy filter requests by User-Agent.
So, now I know why my code fails, and I can do HTTP requests, but the code keep on failing with HTTPS. I set the headers the same way as with HTTP, but after looking at Wireshark, no headers are sent in the CONNECT message, so the proxy sees no User-Agent and returns an ACCESS DENIED response.
I think that if only I could send the headers with the CONNECT message, I could do HTTPS requests easily, but I'm breaking my head around how to tell Requests that I want to send that headers.
Ok, so I found a way after looking at http.client. It's a bit lower level than using Requests but at least it work.
def HTTPSProxyRequest(method, host, url, proxy, header=None, proxy_headers=None, port=443):
https = http.client.HTTPSConnection(proxy[0], proxy[1])
https.set_tunnel(host, port, headers=proxy_headers)
https.connect()
https.request(method, url, headers=header)
response = https.getresponse()
return response.read(), response.status
# calling the function
HTTPSProxyRequest('GET','google.com', '/index.html', ('myproxy.com',8080))

requests library https get via proxy leads to error

Trying to send a simple get request via a proxy. I have the 'Proxy-Authorization' and 'Authorization' headers, don't think I needed the 'Authorization' header, but added it anyway.
import requests
URL = 'https://www.google.com'
sess = requests.Session()
user = 'someuser'
password = 'somepass'
token = base64.encodestring('%s:%s'%(user,password)).strip()
sess.headers.update({'Proxy-Authorization':'Basic %s'%token})
sess.headers['Authorization'] = 'Basic %s'%token
resp = sess.get(URL)
I get the following error:
requests.packages.urllib3.exceptions.ProxyError: Cannot connect to proxy. Socket error: Tunnel connection failed: 407 Proxy Authentication Required.
However when I change the URL to simple http://www.google.com, it works fine.
Do proxies use Basic, Digest, or some other sort of authentication for https? Is it proxy server specific? How do I discover that info? I need to achieve this using the requests library.
UPDATE
Its seems that with HTTP requests we have to pass in a Proxy-Authorization header, but with HTTPS requests, we need to format the proxy URL with the username and password
#HTTP
import requests, base64
URL = 'http://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<IP>:<PORT>}
token = base64.encodestring('%s:%s' %(user, password)).strip()
myheader = {'Proxy-Authorization': 'Basic %s' %token}
r = requests.get(URL, proxies = proxies, headers = myheader)
print r.status_code # 200
#HTTPS
import requests
URL = 'https://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<user>:<password>#<IP>:<PORT>}
r = requests.get(URL, proxies = proxy)
print r.status_code # 200
When sending an HTTP request, if I leave out the header and pass in a proxy formatted with user/pass, I get a 407 response.
When sending an HTTPS request, if I pass in the header and leave the proxy unformatted I get a ProxyError mentioned earlier.
I am using requests 2.0.0, and a Squid proxy-caching web server. Why doesn't the header option work for HTTPS? Why does the formatted proxy not work for HTTP?
The answer is that the HTTP case is bugged. The expected behaviour in that case is the same as the HTTPS case: that is, you provide your authentication credentials in the proxy URL.
The reason the header option doesn't work for HTTPS is that HTTPS via proxies is totally different to HTTP via proxies. When you route a HTTP request via a proxy, you essentially just send a standard HTTP request to the proxy with a path that indicates a totally different host, like this:
GET http://www.google.com/ HTTP/1.1
Host: www.google.com
The proxy then basically forwards this on.
For HTTPS that can't possibly work, because you need to negotiate an SSL connection with the remote server. Rather than doing anything like the HTTP case, you use the CONNECT verb. The proxy server connects to the remote end on behalf of the client, and from them on just proxies the TCP data. (More information here.)
When you attach a Proxy-Authorization header to the HTTPS request, we don't put it on the CONNECT message, we put it on the tunnelled HTTPS message. This means the proxy never sees it, so refuses your connection. We special-case the authentication information in the proxy URL to make sure it attaches the header correctly to the CONNECT message.
Requests and urllib3 are currently in discussion about the right place for this bug fix to go. The GitHub issue is currently here. I expect that the fix will be in the next Requests release.

What is the proper way to handle site requests with cookie handlers in Python3?

I'm writing a script in Python 3.1.2 that logs into a site and then begins to make requests. I can log in without any great difficulty, but after doing that the requests return an error stating I haven't logged in. My code looks like this:
import urllib.request
from http import cookiejar
from urllib.parse import urlencode
jar = cookiejar.CookieJar()
credentials = {'accountName': 'username', 'password': 'unenc_pw'}
credenc = urlencode(credentials)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(jar))
urllib.request.install_opener(opener)
req = opener.open('http://www.wowarmory.com/?app=armory?login&cr=true', credenc)
test = opener.open('http://www.wowarmory.com/auctionhouse/search.json')
print(req.read())
print(test.read())
The response to the first request is the page I expect to get when logging in.
The response to the second is:
b'{"error":{"code":10005,"error":true,"message":"You must log in."},"command":{"sort":"RARITY","reverse":false,"pageSize":20,"end":20,"start":0,"minLvl":0,"maxLvl":0,"id":0,"qual":0,"classId":-1,"filterId":"-1"}}'
Is there something I'm missing to use any cookie information I have from successful authentication for future requests?
I had this issue once. I can't get the cookie the cookie management working automatically. Frustrated me for days, I ended up handling the cookie manually. That is getting the content of 'Set-Cookie' from the response header, saving it somewhere safe. Subsequently, any request made to that server, I will set the 'Cookie' into the request header with the value I got earlier.

Categories

Resources