I want to use proxy middleware in my Scrapy but not every request needs a proxy. I don't want to abuse the proxy usage and make the proxy prone to get banned.
Is there a way for me to disable proxy in some requests when the proxy middleware is turned on?
we can add dont_proxy meta and sets it to true on the requests
yield scrapy.Request(
url,
meta={"dont_proxy": True},
callback=self.parse
)
It's in the docs.
You can set the meta key proxy per-request, to a value like http://some_proxy_server:port.
Related
I am quite new to Scrapy / ProxyMesh.
My request to the Proxymesh server seems to be working as I see my bandwith consumption on the ProxyMesh website, and the meta.proxy is correct in my logs.
However, when I log the response headers in Scrapy, I do not receive the X-Proxymesh-IP that I am supposed to receive.
Here is my code. What am I doing wrong?
This is my middleware
class Proxymesh(object):
def __init__(self):
logging.debug('Initialized Proxymesh middleware')
self.proxy_ip = 'http://host:port'
def process_request(self, request, spider):
logging.debug('Processing request through proxy IP: ' + self.proxy_ip)
request.meta['proxy'] = self.proxy_ip
These are my settings in my spider
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"projectName.middlewares.proxymesh.Proxymesh" : 1,
}
This is what the response headers look like
['Set-Cookie']:['__cfduid=d88d4e4cb7... HttpOnly']
['Vary']:['User-Agent,Accept-Encoding']
['Server']:['cloudflare-nginx']
['Date']:['Thu, 19 Oct 2017 10...38:10 GMT']
['Cf-Ray']:['3b031b30cbef1565-CDG']
['Content-Type']:['text/html; charset=UTF-8']
Thank you for your help
Don't know if this relevant anymore but I'm going to post it here. There's an issue with proxymesh and scrapy or python requests.
When connecting to a proxy, a CONNECT request is sent to the proxy service in order to create a tunnel which will forward the actual request.
If the request is successful, proxymesh adds the X-Proxymesh-IP in the CONNECT requests's confirmation response. This is header totally missed by scrapy as it only takes into consideration the response headers of the actual request.
This only happens to HTTPS requests because the content of the actual request is encrypted.
References:
https://docs.proxymesh.com/article/74-proxy-server-headers-over-https
https://bugs.python.org/issue24964?fbclid=IwAR1c88hpLu2OdmEXlwfZfb2n8lMIqT8JvjLeO7pzsvFEiZBVlfJNpYZ4aFk
https://github.com/requests/requests/issues/3061?fbclid=IwAR34XDJa7dJqNpH33LRlvpoRHpaZJhVl75zXfFkEuBa7IjOVCoIxecW0zhw
Maybe you need to do this too?
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
}
And also in your callback function, are you sure you are printing response.headers
I am trying to test a proxy connection by using urllib2.ProxyHandler. However, there probably some situation that I am going to request a HTTPS website (eg: https://www.whatismyip.com/)
Urllib2.urlopen() will throw ERROR if request a HTTPS site. So I tried to use a helper function to rewrite the URLOPEN method.
Here is the helper function:
def urlopen(url, timeout):
if hasattr(ssl, 'SSLContext'):
SslContext = ssl.create_default_context()
SslContext.check_hostname = False
SslContext.verify_mode = ssl.CERT_NONE
return urllib2.urlopen(url, timeout=timeout, context=SslContext)
else:
return urllib2.urlopen(url, timeout=timeout)
This helper function based on answer
Then I use:
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler({'http': '127.0.0.1:8080'})
)
)
to setup http proxy for urllib.opener.
Ideally, it should working when i request a website by using urlopen('http://whatismyip.com', 30) and it should pass all traffic through http proxy.
However, the urlopen() will fall into if hasattr(ssl, 'SSLContext') all the time even if it is a HTTP site. In addition, HTTPS site is not using HTTP proxy either. This cause the HTTP proxy become invalid and all traffic going through unproxied network
I also tried this answer to change HTTP into HTTPS urllib2.ProxyHandler({'https': '127.0.0.1:8080'}) but it still not working.
My proxy is working. If i am using urllib2.urlopen() instead of the rewrite version urlopen(), it works for HTTP site.
But, I do need consider the suitation if the urlopen gonna need to be used on a HTTPS ONLY site.
How to do that?
Thanks
UPDATE1: I cannot get this work with Python 2.7.11 and some of server working properly with Python 2.7.5. I assue it is python version issue.
Urllib2 will not go through HTTPS Proxy so all HTTPS web address will failed to use proxy.
The problem is when you pass context argument to urllib2.urlopen() then urllib2 creates opener itself instead of using the global one, which is the one that gets set when you call urllib2.install_opener(). As a result your instance of ProxyHandler which you meant to be used is not being used.
The solution is not to install opener but to use the opener directly. When building your opener, you have to pass both an instance of your ProxyHandler class (to set proxies for http and https protocols) and an instance of HTTPSHandler class (to set https context).
I created https://bugs.python.org/issue29379 for this issue.
I personally would suggest the use of something such as python-requests as it will alleviate a lot of the issues with setting up the proxy using urllib2 directly. When using requests with a proxy you will have to do: (From their documentation)
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
And disabling SSL Certificate verification is as simple as passing verify=False the requests.get command above. However, this should be used sparingly and the actual issue with the SSL Cert verification should be resolve.
One more solution is to pass context into HTTPSHandler and pass this handler into build_opener together with ProxyHandler:
proxies = {'https': 'http://localhost:8080'}
proxy = urllib2.ProxyHandler(proxies)
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
handler = urllib2.HTTPSHandler(context=context)
opener = urllib2.build_opener(proxy, handler)
urllib2.install_opener(opener)
Now you can view all your HTTPS requests/responses in your proxy.
I am new in scrapy. I found that for use http proxy but I want to use http and https proxy together because when I crawl the links there has http and https links. How do I use also http and https proxy?
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
#like here request.meta['proxy'] = "https://YOUR_PROXY_IP:PORT"
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
You could use standard environment variables with the combination of the HttpProxyMiddleware:
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port.
Does os.environ['http_proxy'] still work?
And how to utilize proxy per request?
HTTP Proxy support has added to aiohttp in the recent 0.7.3 release.
It doesn't use os.environ['http_proxy'] and probably will never do.
To specify proxy for request you can use code like this:
connector = aiohttp.ProxyConnector(proxies={'http': 'http://proxyaddr:8118'})
response = yield from aiohttp.request('get', 'http://python.org/', connector=connector)
HTTPS proxies are not supported yet, sorry.
Perhaps we add the feature very soon: we need for HTTPS proxies for our business tasks.
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors