I am new in scrapy. I found that for use http proxy but I want to use http and https proxy together because when I crawl the links there has http and https links. How do I use also http and https proxy?
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
#like here request.meta['proxy'] = "https://YOUR_PROXY_IP:PORT"
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
You could use standard environment variables with the combination of the HttpProxyMiddleware:
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port.
Related
Hi thanks for taking the time to read this, I have residential proxies that i want to use in scrapy but in scrapy documentation we have proxies like this
http://some_proxy_server:port or http://username:password#some_proxy_server:port
The proxies i have look like this
['username:password_session-xxxxx:some_proxy_server:port'
'username:password_session-xxxx1:some_proxy_server:port',
'username:password_session-xxxx2_session:some_proxy_server:port']
so if you see normally proxy server and port change but username and password stay the same and we have list of server:port that rotates but when i start scraper with my proxies it only sends requests to some_proxy_server:port which is same for all the proxies but session is different, i have tried adding this in middleware, used these libraries but all these treat these proxies the same.
scrapy-rotated-proxy==0.1.5
scrapy-rotating-proxies==0.6.2
so my question is how do i use the proxies where we have different session_id instead of host and port
http://username#password-session_id#host:port
Update:
here is the middleware i used
with open('files/proxies.txt', 'r') as f:
proxies = f.read()
proxies=proxies.split('\n')
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
print('_______', random.choice(proxies))
request.meta["proxy"] = random.choice(proxies)
request.headers["Proxy-Authorization"] = basic_auth_header("username", "password")
Requests isn't using the proxies I pass to it. The site at the url I'm using shows which ip the request came from--and it's always my ip not the proxy ip. I'm getting my proxy ips from sslproxies.org which are supposed to be anonymous.
url = 'http://www.lagado.com/proxy-test'
proxies = {'http': 'x.x.x.x:xxxx'}
headers = {'User-Agent': 'Mozilla...etc'}
res = requests.get(url, proxies=proxies, headers=headers)
Are there certain headers that need to be used or something else that needs to be configured so that my ip is hidden from the server?
The docs state
that proxy URLs must include the scheme.
Where scheme is scheme://hostname. So you should add 'http://' or 'socks5://' to your proxy URL, depending on the protocol you're using.
I want to use proxy middleware in my Scrapy but not every request needs a proxy. I don't want to abuse the proxy usage and make the proxy prone to get banned.
Is there a way for me to disable proxy in some requests when the proxy middleware is turned on?
we can add dont_proxy meta and sets it to true on the requests
yield scrapy.Request(
url,
meta={"dont_proxy": True},
callback=self.parse
)
It's in the docs.
You can set the meta key proxy per-request, to a value like http://some_proxy_server:port.
I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.
The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.
I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.
One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.
Trying to send a simple get request via a proxy. I have the 'Proxy-Authorization' and 'Authorization' headers, don't think I needed the 'Authorization' header, but added it anyway.
import requests
URL = 'https://www.google.com'
sess = requests.Session()
user = 'someuser'
password = 'somepass'
token = base64.encodestring('%s:%s'%(user,password)).strip()
sess.headers.update({'Proxy-Authorization':'Basic %s'%token})
sess.headers['Authorization'] = 'Basic %s'%token
resp = sess.get(URL)
I get the following error:
requests.packages.urllib3.exceptions.ProxyError: Cannot connect to proxy. Socket error: Tunnel connection failed: 407 Proxy Authentication Required.
However when I change the URL to simple http://www.google.com, it works fine.
Do proxies use Basic, Digest, or some other sort of authentication for https? Is it proxy server specific? How do I discover that info? I need to achieve this using the requests library.
UPDATE
Its seems that with HTTP requests we have to pass in a Proxy-Authorization header, but with HTTPS requests, we need to format the proxy URL with the username and password
#HTTP
import requests, base64
URL = 'http://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<IP>:<PORT>}
token = base64.encodestring('%s:%s' %(user, password)).strip()
myheader = {'Proxy-Authorization': 'Basic %s' %token}
r = requests.get(URL, proxies = proxies, headers = myheader)
print r.status_code # 200
#HTTPS
import requests
URL = 'https://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<user>:<password>#<IP>:<PORT>}
r = requests.get(URL, proxies = proxy)
print r.status_code # 200
When sending an HTTP request, if I leave out the header and pass in a proxy formatted with user/pass, I get a 407 response.
When sending an HTTPS request, if I pass in the header and leave the proxy unformatted I get a ProxyError mentioned earlier.
I am using requests 2.0.0, and a Squid proxy-caching web server. Why doesn't the header option work for HTTPS? Why does the formatted proxy not work for HTTP?
The answer is that the HTTP case is bugged. The expected behaviour in that case is the same as the HTTPS case: that is, you provide your authentication credentials in the proxy URL.
The reason the header option doesn't work for HTTPS is that HTTPS via proxies is totally different to HTTP via proxies. When you route a HTTP request via a proxy, you essentially just send a standard HTTP request to the proxy with a path that indicates a totally different host, like this:
GET http://www.google.com/ HTTP/1.1
Host: www.google.com
The proxy then basically forwards this on.
For HTTPS that can't possibly work, because you need to negotiate an SSL connection with the remote server. Rather than doing anything like the HTTP case, you use the CONNECT verb. The proxy server connects to the remote end on behalf of the client, and from them on just proxies the TCP data. (More information here.)
When you attach a Proxy-Authorization header to the HTTPS request, we don't put it on the CONNECT message, we put it on the tunnelled HTTPS message. This means the proxy never sees it, so refuses your connection. We special-case the authentication information in the proxy URL to make sure it attaches the header correctly to the CONNECT message.
Requests and urllib3 are currently in discussion about the right place for this bug fix to go. The GitHub issue is currently here. I expect that the fix will be in the next Requests release.