Scrapy - stop requests but process responses - python

I have a Scrapy project with a lot of spiders. There is a server side solution to restart HMA VPN in order to change interface IP(so that we get different IP and don't get blocked).
There is a custom download middleware that sends corresponding socket message for each request and response so that server side solution can trigger VPN restart. Obviously Scrapy must NOT yield any new requests when VPN restart is about to happen - we control that by having a lock file. Scrapy however must handle all not yet received responses before VPN restart can actually happen.
Putting sleep in download middleware stops Scrapy completely. Is there a way to handle responses but hold off new requests(until lock file gets removed)?
This obviously is the case when more then 1x concurrent request is yielded.
Following middleware code is used:
class CustomMiddleware(object):
def process_request(self, request, spider):
while os.path.exists(LOCK_FILE_PATH):
time.sleep(10)
# Send corresponding socket message("OPEN")
def process_response(self, request, response, spider):
# Send corresponding socket message("CLOSE")
return response

Turned out solution is very simple:
if os.path.exists(LOCK_FILE_PATH):
return request
Doing so request is passed through middlewares all over until it can be executed.

Related

How to write a mitmproxy addon that avoids any network request?

I tried mitmproxy in the last couple of days as a test tool and works excellent. However, while I'm able to write add-ons that intercept requests (even changing their URL, like my example below), I couldn't avoid that the request is actually dispatched in the network.
One way or another, always the request is performed using the network.
So, how can I modify my add-on in a way that, giving a request, it returns a fixed response, avoiding any networking request?
class Interceptor:
def request(self, flow: http.HTTPFlow):
if http.method() == "GET":
flow.request.url = "http://google.com"
def response(self, flow: http.HTTPFlow):
return http.HTTPResponse.make(status_code=200,b"Rambo 5")
The request hook will be executed when mitmproxy has received the request, the response hook will be executed once we have fetched the response from the server. Long story short, everything in the response hook is too late.
Instead, you need to assign flow.response in the request hook.

How to perform one final request in scrapy after all requests are done?

In the spider I'm bulding, I'm required to login to the website to start performing requests (which is quite simple), and then I go through a loop to perform some thousand requests.
However, in this website in particular, if I do not logout, I get a 10 minute penalty before I can log in again. So I've tried to logout after the loop is done, with a lower priority, like this:
def parse_after_login(self, response):
for item in [long_list]:
yield scrapy.Request(..., callback=self.parse_result, priority=100)
# After all requests have been made, perform logout:
yield scrapy.Request('/logout/', callback=self.parse_logout, priority=0)
However, there is no guarantee that the logout request won't be ready before the other requests are done processing, so a premature logout will invalidate the other requests.
I have found no way of performing a new request with the spider_closed signal.
How can I perform a new request after all other requests are completed?
you can use the spider_idle signal, which could send a request when the spider stopped processing everything.
so once you connect a method to the spider_idle signal with:
self.crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)
you can now use the self.spider_idle method to call final tasks once the spider stopped processing everything:
class MySpider(Spider):
...
self.logged_out = False
...
def spider_idle(self, spider):
if not self.logged_out:
self.logged_out = True
req = Request('someurl', callback=self.parse_logout)
self.crawler.engine.crawl(req, spider)

Pass posted data to a teardown function

I receive posted data and immediately return an empty 200 OK response. After that I will process the received data. I'm considering how to do it with a teardown function but I didn't find how to pass it the received data:
#app.route('/f', methods = ['POST'])
def f():
data = request.stream.read()
return ''
#app.teardown_request
def teardwon_request(exception=None):
# How to use posted data here?
Flask version is 0.10.1
I'm trying to implement a Paypal IPN listener
https://developer.paypal.com/webapps/developer/docs/classic/ipn/gs_IPN/#overview
Notice that the listener's HTTP 200 response happens before the listener's IPN message.
You are overcomplicating things; just send a request from your Flask server in the request handler. Paypal IPN notifications just require a empty 200 response, Paypal does not mandate that you send the 200 OK before you can send the HTTP request to their servers.
The overview page is indeed confusing, but the PHP code posted won't close the request until the Paypal IPN post back to their server has completed either.
If this was a hard requirement (making this a terrible design), you'd have to handle the request back to Paypal asynchronously. You can do this with a separate thread, for example, using a queue, push in the data you received from the IPN, and have a separate thread poll the queue and communicate to Paypal from that thread. Or you could use Celery to simplify the job (push a task out to be handled asynchronously). Either way, this would let you close the incoming request early.

Setting Scrapy proxy middleware to rotate on each request

This question necessarily comes in two forms, because I don't know the better route to a solution.
A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...
When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)
What is the most efficient solution?
(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)
(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.
For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://open.proxymesh.com:[port number]"
proxy_user_pass = "username:password"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site )
The idea is in random.choice. Every request has a chance of changing IP.
Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.
from scrapy import log
from settings import USER_AGENT_LIST
import random
import telnetlib
import time
# 15% ip change
class RetryChangeProxyMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 15:
log.msg('Changing proxy')
tn = telnetlib.Telnet('127.0.0.1', 9051)
tn.read_until("Escape character is '^]'.", 2)
tn.write('AUTHENTICATE "<PASSWORD HERE>"\r\n')
tn.read_until("250 OK", 2)
tn.write("signal NEWNYM\r\n")
tn.read_until("250 OK", 2)
tn.write("quit\r\n")
tn.close()
log.msg('>>>> Proxy changed. Sleep Time')
time.sleep(10)
# 30% useragent change
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 30:
log.msg('Changing UserAgent')
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
log.msg('>>>> UserAgent changed')

the sample python twisted event driven web application increments request count by 2, why?

The sample code for a basic web server given by http://twistedmatrix.com/trac/ seems to increment the request counter by two for each request, rather than by 1.
The code:
from twisted.web import server, resource
from twisted.internet import reactor
class HelloResource(resource.Resource):
isLeaf = True
numberRequests = 0
def render_GET(self, request):
self.numberRequests += 1
request.setHeader("content-type", "text/plain")
return "I am request #" + str(self.numberRequests) + "\n"
reactor.listenTCP(8080, server.Site(HelloResource()))
reactor.run()
Looking at the code, it looks like you should be able to connect to the url http://localhost:8080 and see:
I am request #1
Then refresh the page and see:
I am request #2
However, I see:
I am request #3
When I refresh again, I see:
I am request #5
So, judging from the counter, the server appears to call the function "render_GET" twice for each request. I am running this on Windows 7 using Python 2.7. Any idea what could be going on or is this expected behavior?
Update: The code is working perfectly, it's the browser that is being tricksy. Each page refresh, the browser sends a GET request for "/" and "/favicon.ico", which accounts for the incrementing by 2, because the render_GET function actually is being called twice per page refresh.
Browsers can behave in surprising ways. If you try printing the full request, you might find it is requesting "/" and also "favicon.ico", for example.
The browser might be making a second request for the favicon.ico.
You should have your server print the request location when it gets a request. This would tell you if this is correct.

Categories

Resources