scrapy: switch out failing proxies - python

I'm using this script to randomize proxies in scrapy. The problem is that once it's allocated a proxy to a request, it won't allocate another one because of this code:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
return
That means that if there is a bad proxy which is not connecting to anything, then the request will fail. I'm intending to modify it like this:
if request.meta.get('retry_times',0) < 5:
return
thereby letting it allocate a new proxy if the current one fails 5 times. I'm assuming that if I set RETRY_TIMES to, say 20, in settings.py, then the request won't fail until 4 different proxies have each made 5 attempts.
I'd like to know if that will cause any problems. As I understand it, the reason that the check is there in the first place is for stateful transactions, such as those relying on log-ins, or perhaps cookies. Is that correct?

I bumped with the same problem.
I improved the aivarsk/scrapy-proxies. My middleware inherited by basic RetryMiddleware and trying to use one proxy RETRY_TIMES. If proxy is unavailable, the middleware change it.

Yes, I think the idea of that script was to check if the user is already defining a proxy on the meta parameter, so it can control it from the spider.
Setting it to change proxy every 5 times is ok, but I think you'll have to re login to the page, as most pages know when you changed from where you are making the request (proxy).
The idea if rotating proxies is not as easy as just selecting one randomly, because you could still end up using the same proxy, and also defining the rules for when a site "banned" you is not as simple as only checking statuses. This are the services I know for that thing you want: Crawlera and Proxymesh.
If you want direct functionality on scrapy for rotating proxies, I recommend to use Crawlera as it is already fully integrated.

Related

Problem with detecting if link is invalid

Is there any way to detect if a link is invalid using webbot?
I need to tell the user that the link they provided was unreachable.
The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens. Something like:
import requests
from requests.exceptions import ConnectionError
def check_url(url):
try:
r = requests.get(url, timeout=1)
return r.status_code == 200
except ConnectionError:
return False
Is this a good idea? It's only a GET request, and get is supposed to idempotent, so you shouldn't cause anybody any harm. On the other hand, what if a user sets up a script to add a new link every second pointing to the same website? Then you're DDOSing that website. So when you allow users to cause your server to do things like this, you need to think how you might protect it. (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn't hold the link.)
Note that if you just want to check the link points to a valid domain it's a bit easier: you can just do a dns query. (The same point about caching and avoiding abuse probably applies.)
Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop. Otherwise your code will block for at least timeout seconds.
(Another attack: this gets the whole page. What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I've deliberately not complicated the example code here because I wanted to illustrate the principle.)
Note that this just checks that something is available on that page. What if it's one of the many dead links which redirects to a domain-name website? You could enforce 'no redirects'---but then some redirects are valid. (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders' domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it's worth being aware of.
You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick. Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Twisted Proxy with range splitting

I want to write a twisted proxy that splits up very large GET request into smaller fixed size ranges and sends it on to another proxy (using the Range: bytes). The other proxy doesn't allow large responses and when the response is to large it returns a 502.
How can I implement a proxy in twisted that on a 502 error it tries splitting the request into smaller allowed chunks. The documentation is hard to follow. I know I need to extend ProxyRequest, but from there I'm a bit stuck.
It doesn't have to be a twisted proxy, but it seems to be easily modified and I managed to at least get it to forward the request unmodified to the proxy by just setting the connectTCP to my proxy (in ProxyRequest.parsed).
Extending ProxyRequest is probably not the easiest way to do this, actually; ProxyRequest pretty strongly assumes that one request = one response, whereas here you want to split up a single request into multiple requests.
Easier would be to simply write a Resource implementation that does what you want, which briefly would be:
in render_GET, construct a URL to make several outgoing requests using Agent
return NOT_DONE_YET
as each response comes in, call request.write on your original incoming requests, and then issue a new request with a Range header
finally when the last response comes in, call request.finish on your original request
You can simply construct a Site object with your Resource, and set isLeaf on your Resource to true so your Resource doesn't have to implement any traversal logic and can just build the URL using request.prePathURL and request.postpath. (request.postpath is sadly undocumented; it's a list of the not-yet-traversed path segments in the request).

Pythonanywhere Web2Py redirect to HTTPS

I have create a webproject Web2Py and would like user to access the pages on normal http:// instaed of http://.
Each time I type http://domain.pythonanywhere.com et redirect me to http://domain.pythonanywhere.com.
It taces 0.5 sec. to do the SSL check and I would like to avoid that.
This was as default:
## if SSL/HTTPS is properly configured and you want all HTTP requests to
## be redirected to HTTPS, uncomment the line below:
# request.requires_https()
PythonAnywhere dev here: that looks like a bug on our side. We "pin" HTTPS for our own site, so that people always go to https://www.pythonanywhere.com/, but it looks like that might have leaked over to customer sites.
Just for clarity -- if someone goes to http://yourusername.pythonanywhere.com/ then we won't initially force it to go to the https site -- they'll get the http one. But if they then go to https://yourusername.pythonanywhere.com, then their browser will remember that they have visited the https domain, so all future requests will redirect there.
That's actually generally good practice (it works around a number of security problems) but we shouldn't be forcing it on people.
[UPDATE] the bug is now fixed, many thanks to boje for pointing us at it :-) One caveat -- if you've ever visited your site over HTTPS before we applied the fix, then you'll still be forced to HTTPS. You need to clear your browser history to see the new unpinned behaviour.
I had an issue let http:// redirect to https:// And I found google group post on here. The following code maybe give you some ideas on your problem, Under db.py add:
############ FORCED SSL #############
from gluon.settings import global_settings
if global_settings.cronjob:
print 'Running as shell script.'
elif not request.is_https:
redirect(URL(scheme='https', args=request.args, vars=request.vars))
session.secure()
#####################################

Does urllib2.urlopen() cache stuff?

They didn't mention this in python documentation. And recently I'm testing a website simply refreshing the site using urllib2.urlopen() to extract certain content, I notice sometimes when I update the site urllib2.urlopen() seems not get the newly added content. So I wonder it does cache stuff somewhere, right?
So I wonder it does cache stuff somewhere, right?
It doesn't.
If you don't see new data, this could have many reasons. Most bigger web services use server-side caching for performance reasons, for example using caching proxies like Varnish and Squid or application-level caching.
If the problem is caused by server-side caching, usally there's no way to force the server to give you the latest data.
For caching proxies like squid, things are different. Usually, squid adds some additional headers to the HTTP response (response().info().headers).
If you see a header field called X-Cache or X-Cache-Lookup, this means that you aren't connected to the remote server directly, but through a transparent proxy.
If you have something like: X-Cache: HIT from proxy.domain.tld, this means that the response you got is cached. The opposite is X-Cache MISS from proxy.domain.tld, which means that the response is fresh.
Very old question, but I had a similar problem which this solution did not resolve.
In my case I had to spoof the User-Agent like this:
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
content = urllib2.build_opener().open(request)
Hope this helps anyone...
Your web server or an HTTP proxy may be caching content. You can try to disable caching by adding a Pragma: no-cache request header:
request = urllib2.Request(url)
request.add_header('Pragma', 'no-cache')
content = urllib2.build_opener().open(request)
If you make changes and test the behaviour from browser and from urllib, it is easy to make a stupid mistake.
In browser you are logged in, but in urllib.urlopen your app can redirect you always to the same login page, so if you just see the page size or the top of your common layout, you could think that your changes have no effect.
I find it hard to believe that urllib2 does not do caching, because in my case, upon restart of the program the data is refreshed. If the program is not restarted, the data appears to be cached forever. Also retrieving the same data from Firefox never returns stale data.

Categories

Resources