Setting Scrapy proxy middleware to rotate on each request - python

This question necessarily comes in two forms, because I don't know the better route to a solution.
A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...
When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)
What is the most efficient solution?
(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)
(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.
For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://open.proxymesh.com:[port number]"
proxy_user_pass = "username:password"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site )
The idea is in random.choice. Every request has a chance of changing IP.
Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.
from scrapy import log
from settings import USER_AGENT_LIST
import random
import telnetlib
import time
# 15% ip change
class RetryChangeProxyMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 15:
log.msg('Changing proxy')
tn = telnetlib.Telnet('127.0.0.1', 9051)
tn.read_until("Escape character is '^]'.", 2)
tn.write('AUTHENTICATE "<PASSWORD HERE>"\r\n')
tn.read_until("250 OK", 2)
tn.write("signal NEWNYM\r\n")
tn.read_until("250 OK", 2)
tn.write("quit\r\n")
tn.close()
log.msg('>>>> Proxy changed. Sleep Time')
time.sleep(10)
# 30% useragent change
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 30:
log.msg('Changing UserAgent')
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
log.msg('>>>> UserAgent changed')

Related

request.urlopen(url) not return website response or timeout

I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.

Using proxy with selenium and scrapy implementation

I am building a spider that seeks to use selenium as well as a proxy. The main goal is to make the spider as rigid as possible in avoiding getting caught for webscraping. I know that scrapy has the module 'scrapy-rotating-proxies' but I'm having trouble verifying that scrapy would check the status of the chromedriver's success in requesting a webpage and if it fails due to getting caught then run the process of switching the proxy.
Second, I am somewhat unsure of how a proxy is handled by my computer. For example, if in any case when I set a proxy value is this value consistent for anything that makes a request on my computer? Ie. will scrapy and webdriver have the same proxy values as long as one of them sets the value? Especially if scrapy has a proxy value, will any selenium webdriver instantiated inside of the class definition inherit that proxy?
I'm quite inexperienced with these tools and would really appreciate some help!
I've tried looking for a method to test and check the proxy value of selenium as well as scrapy to compare
#gets the proxies and sets the value of the scrapy proxy list in settings
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
#Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
new_proxy_list = []
for i in range(1,30):
#Get a proxy from the pool
proxy = next(proxy_pool)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
#Grab and append proxy if valid
new_proxy_list.append(proxy)
except:
#Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work.
#We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url
print("Skipping. Connnection error")
#add to settings proxy list
settings.ROTATING_PROXY_LIST = new_proxy_list

Mitmproxy redirect not working in python (Raspberry Pi)

I've been working on a redirect script for mitmproxy running on a raspberry pi. I've looked at the post here and it didn't work. What happened was that the request was still going through to the original host url. After some changes, it would attempt to redirect to the new site but the status codes weren't reflecting that and it was still loading the resources of the original site. Make the path an empty string fixed this temporarily (it was no longer making calls for the resources by appending them on the new url). Also, since the original script did not work, I tried to make changes to reflect how a manual redirect works by adding "Location" to the headers for the response.
import mitmproxy
from mitmproxy.models import HTTPResponse
from netlib.http import Headers
def request(flow):
if flow.request.pretty_host.endswith("sojourncollege.com"):
mitmproxy.ctx.log( flow.request.path )
method = flow.request.path.split('/')[3].split('?')[0]
flow.request.host = "reddit.com"
flow.request.port = 80
flow.request.scheme = 'http'
flow.request.path = ''
if method == 'getjson':
flow.request.path=flow.request.path.replace(method,"getxml")
flow.request.headers["Host"] = "reddit.com"
flow.response.status_code = 302
flow.response.headers.append("Location")
mitmproxy.ctx.log(flow.response.headers)
flow.response.headers["Location"] = "reddit.com"
What is happening now is repeated 301 Get requests to http://reddit.com and a message of [no content]. If I load up the networking tab on chrome to view the request it is trying to reach "http://wwww.reddit.comhttp/1.1" and I have no idea why that is the case.
This is stuff that has seemed to work for others, I have no idea why it's not on the pi.

Persist authenticated session between crawls for development in Scrapy

I'm using a Scrapy spider that authenticates with a login form upon launching. It then scrapes with this authenticated session.
During development I usually run the spider many times to test it out. Authenticating at the beginning of each run spams the login form of the website. The website will often force a password reset in response and I suspect it will ban the account if this continues.
Because the cookies last a number of hours, there's no good reason to log in this often during development. To get around the password reset problem, what would be the best way to re-use an authenticated session/cookies between runs while developing? Ideally the spider would only attempt to authenticate if the persisted session has expired.
Edit:
My structure is like:
def start_requests(self):
yield scrapy.Request(self.base, callback=self.log_in)
def log_in(self, response):
#response.headers includes 'Set-Cookie': 'JSESSIONID=xx'; Path=/cas/; Secure; HttpOnly'
yield scrapy.FormRequest.from_response(response,
formdata={'username': 'xxx',
'password':''},
callback=self.logged_in)
def logged_in(self, response):
#request.headers and subsequent requests all have headers fields 'Cookie': 'JSESSIONID=xxx';
#response.headers has no mention of cookies
#request.cookies is empty
When I run the same page request in Chrome, under the 'Cookies' tab there are ~20 fields listed.
The documentation seems thin here. I've tried setting a field 'Cookie': 'JSESSIONID=xxx' on the headers dict of all outgoing requests based on the values returned by a successful login, but this bounces back to the login screen
Turns out that for an ad-hoc development solution, this is easier to do than I thought. Get the cookie string with cookieString = request.headers['Cookie'], save, then on subsequent outgoing requests load it up and do:
request.headers.appendlist('Cookie', cookieString)

Scrapy - stop requests but process responses

I have a Scrapy project with a lot of spiders. There is a server side solution to restart HMA VPN in order to change interface IP(so that we get different IP and don't get blocked).
There is a custom download middleware that sends corresponding socket message for each request and response so that server side solution can trigger VPN restart. Obviously Scrapy must NOT yield any new requests when VPN restart is about to happen - we control that by having a lock file. Scrapy however must handle all not yet received responses before VPN restart can actually happen.
Putting sleep in download middleware stops Scrapy completely. Is there a way to handle responses but hold off new requests(until lock file gets removed)?
This obviously is the case when more then 1x concurrent request is yielded.
Following middleware code is used:
class CustomMiddleware(object):
def process_request(self, request, spider):
while os.path.exists(LOCK_FILE_PATH):
time.sleep(10)
# Send corresponding socket message("OPEN")
def process_response(self, request, response, spider):
# Send corresponding socket message("CLOSE")
return response
Turned out solution is very simple:
if os.path.exists(LOCK_FILE_PATH):
return request
Doing so request is passed through middlewares all over until it can be executed.

Categories

Resources