I've written a script in python using Scrapy to send a request to a webpage through proxy without changing anything in the settings.py or DOWNLOADER_MIDDLEWARES. It is working great now. However, the only thing I can't make use of is creating a list of proxies so that If one fails another will be in use. How can I twitch this portion os.environ["http_proxy"] = "http://176.58.125.65:80" to get list of proxies one by one as it supports only one. Any help on this will be highly appreciated.
This is what I've tried so far (working one):
import scrapy, os
from scrapy.crawler import CrawlerProcess
class ProxyCheckerSpider(scrapy.Spider):
name = 'lagado'
start_urls = ['http://www.lagado.com/proxy-test']
os.environ["http_proxy"] = "http://176.58.125.65:80" #can't modify this portion to get list of proxies
def parse(self, response):
stat = response.css(".main-panel p::text").extract()[1:3]
yield {"Proxy-Status":stat}
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(ProxyCheckerSpider)
c.start()
I do not want to change anything in the settings.py or create any custom middleware to serve the purpose. I wish to achieve the same (externally) like I did above with a single proxy. Thanks.
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port or http://username:password#some_proxy_server:port.
from official docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
So you need to write your own middleware that would do:
Catch failed responses
If response failed because of proxy:
replace request.meta['proxy'] value with new proxy ip
reschedule request
Alternative you can look into scrapy extensions packages that are already made to solve this: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
Related
I am building a spider that seeks to use selenium as well as a proxy. The main goal is to make the spider as rigid as possible in avoiding getting caught for webscraping. I know that scrapy has the module 'scrapy-rotating-proxies' but I'm having trouble verifying that scrapy would check the status of the chromedriver's success in requesting a webpage and if it fails due to getting caught then run the process of switching the proxy.
Second, I am somewhat unsure of how a proxy is handled by my computer. For example, if in any case when I set a proxy value is this value consistent for anything that makes a request on my computer? Ie. will scrapy and webdriver have the same proxy values as long as one of them sets the value? Especially if scrapy has a proxy value, will any selenium webdriver instantiated inside of the class definition inherit that proxy?
I'm quite inexperienced with these tools and would really appreciate some help!
I've tried looking for a method to test and check the proxy value of selenium as well as scrapy to compare
#gets the proxies and sets the value of the scrapy proxy list in settings
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
#Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
new_proxy_list = []
for i in range(1,30):
#Get a proxy from the pool
proxy = next(proxy_pool)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
#Grab and append proxy if valid
new_proxy_list.append(proxy)
except:
#Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work.
#We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url
print("Skipping. Connnection error")
#add to settings proxy list
settings.ROTATING_PROXY_LIST = new_proxy_list
I have built a scraper and would like to download some images using a proxy in scrapy. I don't know if it is really downloading through the proxy. Reponse Headers don't show the IP. Furthermore, if I change the IP to a random IP, it still downloads the Image.
How can I ensure it is using a proxy to download the images?
Thanks
Pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
meta = {'proxy': 'http://23.323.44.22:11111/'}
for image_url in item['image_urls']:
yield scrapy.Request(image_url,meta=meta)
Settings.py
ITEM_PIPELINES = {'myproject.pipelines.MyImagesPipeline': 1}
If the download works with a random IP, the proxy is not used.
The Scrapy Doc says:
"You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port. Maybe the '/' at the end of your proxy url confuses Scrapy?
To make sure that a proxy is used, I would use Wireshark and filters on the proxy IP. If you see traffic for it's IP, it is likely that it is used.
I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.
Anyway, I'm trying to make a method to check if a file exists and here is my code:
def file_exists(self, url):
res = False;
response = Request(url, method='HEAD', dont_filter=True)
if response.status == 200:
res = True
return res
I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.
Is there a way to get the Response object within the code block where you call the Response() method?
If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
import requests
request_object = requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.
Request objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.
Alternatively you can always block asynchronous stack by using something like requests like:
def parse(self, response):
image_url = response.xpath('//img/#href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.
Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.
I'm working with http://robobrowser.readthedocs.org/en/latest/readme.html, (a new python library based on the beautiful soup and request libraries) within django. My django app contains :
def index(request):
p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
pr="http://10.10.1.10:3128/"
setProxy(pr)
browser = RoboBrowser(history=True)
postedmessage = browser.open(p)
return HttpResponse(postedmessage)
I would like to add a proxy to my code but can't find a reference in the docs on how to do this. Is it possible to do this?
EDIT:
following your recommendation I've changed the code to
pr="http://10.10.1.10:3128/"
setProxy(pr)
browser = RoboBrowser(history=True)
with:
def setProxy(pr):
import os
os.environ['HTTP_PROXY'] = pr
return
I'm now getting:
Django Version: 1.6.4
Exception Type: LocationParseError
Exception Value:
Failed to parse: Failed to parse: 10.10.1.10:3128
Any ideas on what to do next? I can't find a reference to this error
After some recent API cleanup in RoboBrowser, there are now two relatively straightforward ways to control proxies. First, you can configure proxies in your requests session, and then pass that session to your browser. This will apply your proxies to all requests made through the browser.
from requests import Session
from robobrowser import RoboBrowser
session = Session()
session.proxies = {'http': 'http://my.proxy.com/'}
browser = RoboBrowser(session=session)
Second, you can set proxies on a per-request basis. The open, follow_link, and submit_form methods of RoboBrowser now accept keyword arguments for requests.Session.send. For example:
browser.open('http://stackoverflow.com/', proxies={'http': 'http://your.proxy.com'})
Since RoboBrowser uses the request library, you can try to set the proxies as mentioned in the request docs by setting the environment variables HTTP_PROXY and HTTPS_PROXY.
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors