Twisted Proxy with range splitting - python

I want to write a twisted proxy that splits up very large GET request into smaller fixed size ranges and sends it on to another proxy (using the Range: bytes). The other proxy doesn't allow large responses and when the response is to large it returns a 502.
How can I implement a proxy in twisted that on a 502 error it tries splitting the request into smaller allowed chunks. The documentation is hard to follow. I know I need to extend ProxyRequest, but from there I'm a bit stuck.
It doesn't have to be a twisted proxy, but it seems to be easily modified and I managed to at least get it to forward the request unmodified to the proxy by just setting the connectTCP to my proxy (in ProxyRequest.parsed).

Extending ProxyRequest is probably not the easiest way to do this, actually; ProxyRequest pretty strongly assumes that one request = one response, whereas here you want to split up a single request into multiple requests.
Easier would be to simply write a Resource implementation that does what you want, which briefly would be:
in render_GET, construct a URL to make several outgoing requests using Agent
return NOT_DONE_YET
as each response comes in, call request.write on your original incoming requests, and then issue a new request with a Range header
finally when the last response comes in, call request.finish on your original request
You can simply construct a Site object with your Resource, and set isLeaf on your Resource to true so your Resource doesn't have to implement any traversal logic and can just build the URL using request.prePathURL and request.postpath. (request.postpath is sadly undocumented; it's a list of the not-yet-traversed path segments in the request).

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

scrapy: switch out failing proxies

I'm using this script to randomize proxies in scrapy. The problem is that once it's allocated a proxy to a request, it won't allocate another one because of this code:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
return
That means that if there is a bad proxy which is not connecting to anything, then the request will fail. I'm intending to modify it like this:
if request.meta.get('retry_times',0) < 5:
return
thereby letting it allocate a new proxy if the current one fails 5 times. I'm assuming that if I set RETRY_TIMES to, say 20, in settings.py, then the request won't fail until 4 different proxies have each made 5 attempts.
I'd like to know if that will cause any problems. As I understand it, the reason that the check is there in the first place is for stateful transactions, such as those relying on log-ins, or perhaps cookies. Is that correct?
I bumped with the same problem.
I improved the aivarsk/scrapy-proxies. My middleware inherited by basic RetryMiddleware and trying to use one proxy RETRY_TIMES. If proxy is unavailable, the middleware change it.
Yes, I think the idea of that script was to check if the user is already defining a proxy on the meta parameter, so it can control it from the spider.
Setting it to change proxy every 5 times is ok, but I think you'll have to re login to the page, as most pages know when you changed from where you are making the request (proxy).
The idea if rotating proxies is not as easy as just selecting one randomly, because you could still end up using the same proxy, and also defining the rules for when a site "banned" you is not as simple as only checking statuses. This are the services I know for that thing you want: Crawlera and Proxymesh.
If you want direct functionality on scrapy for rotating proxies, I recommend to use Crawlera as it is already fully integrated.

How to perform Tornado requests within a Tornado request

I am using the Tornado Web Server (version 4.1) with Python 2.7 to create a REST web application. One of my request handlers (web.RequestHandler) handles batch requests consisting of multiple HTTP requests combined into one HTTP request using the multipart/mixed content type. I currently have the batch request handler able to receive the POST request and parse the multipart/mixed content into individual requests that look like this:
GET /contacts/3 HTTP/1.1
Accept: application/json
My question is, what would be a good way of converting these inner batched requests into requests that Tornado can service from within my request handler? I would like to collect the responses within my batch request handler and, once these requests are all complete, return a single multipart/mixed response containing all the batched responses.
Using an HTTPClient to execute the batched requests feels like overkill. It seems like I should be able to build a request object and inject it into the web.Application for processing--I'm at a loss as to how to do this however. Thanks!
Tornado doesn't have any direct support for this. Going through an HTTP client is probably going to be the simplest solution. However, if you're really interested in avoiding that route, here's a sketch of a solution, which relies on the interfaces defined in the tornado.httputil module.
Define a class that implements the HTTPConnection interface by saving the arguments to write and write_headers into internal buffers.
The Application is an HTTPServerConnectionDelegate. Call its start_request method with an instance of your connection class as both arguments (the first argument doesn't really matter, but it should be unique and since we won't be reusing "connections" that object is fine).
start_request returns an HTTPMessageDelegate. Call its headers_received, data_received (for POST/PUT), and finish methods to make your request. Once you have called finish, the handler will run and make calls back into your connection object.

Can Django send multi-part responses for a single request?

I apologise if this is a daft question. I'm currently writing against a Django API (which I also maintain) and wish under certain circumstances to be able to generate multiple partial responses in the case where a single request yields a large number of objects, rather than sending the entire JSON structure as a single response.
Is there a technique to do this? It needs to follow a standard such that client systems using different request libraries would be able to make use of the functionality.
The issue is that the client system, at the point of asking, does not know the number of objects that will be present in the response.
If this is not possible, then I will have to chain requests on the client end - for example, getting the first 20 objects & if the response suggests there will be more, requesting the next 20 etc. This approach is an OK work-around, but any subsequent requests rely on the previous response. I'd rather ask once and have some kind of multi-part response.
As far as I know, No you can't send Multipart http response not yet atleast. Multipart response is only valid in http requests. Why? Because no browser as I know of completely supports this.
Firefox 3.5: Renders only the last part, others are ignored.
IE 8: Shows all the content as if it were text/plain, including the boundaries.
Chrome 3: Saves all the content in a single file, nothing is rendered.
Safari 4: Saves all the content in a single file, nothing is rendered.
Opera 10.10: Something weird. Starts rendering the first part as plain/text, and then clears everything. The loading progress bar hangs on 31%.
(Data credits Diego Jancic)

Cookies and HTTP with Python

I wish to "retrieve" the cookies sent by the client in my subclass of BaseHTTPRequestHandler.
Firstly I'm unsure of the exact sequence of sending of headers, in a typical HTTP request and response this is my understanding of the sequence of events:
Client sends request (method, path, HTTP version, host, and ALL headers).
The server responds with a response code, followed by a bunch of headers of its own.
The server then sends the body of the response.
When exactly is the client's POST data sent? Does any overlapping occur in this sequence as described above?
Second, when is it safe to assume that the "Cookie" header has been received by the server. Should all of the client headers have been received by the time self.send_response is called by the server? When in the HTTP communication is the appropriate time to peek at cookie headers in self.headers?
Thirdly, what is the canonical way to parse cookies in Python. I currently believe a Cookie.SimpleCookie should be instantiated, and then data from the cookie headers some how fed into it. Further clouding this problem, is the Cookie classes clunkiness when dealing with the HTTPRequestHandler interfaces. Why does the output from Cookie.output() not end with a line terminator to fit into self.wfile.write(cookie.output()), or instead drop the implicitly provided header name to fit nicely into self.send_header("Set-Cookie", cookie.output())
Finally, the cookie classes in the Cookie module, give the illusion that they're dictionaries of dictionaries. Assigning to different keys in the cookie, does not pack more data into the cookie, but rather generates more cookies... all apparently in the one class, and each generating its own Set-Cookie header. What is the best practise for packing containers of values into cookie(s)?
HTTP is a request/response protocol, without overlap; the body of a POST comes as part of the request (when the verb is POST, of course).
All headers also come as part of the request, including Cookie: if any (there might be no such header of course, e.g. when the browser is running with cookies disabled or whatever). So peek at the headers whenever you've received the request and are serving it.
I'm not sure what your "thirdly" problem is. No newline gets inserted if none is part of the cookie -- why ever should it be? Edit: see later.
On the fourth point, I think you may be confusing cookies with "morsels". There is no limit to the number of Set-Cookie headers in the HTTP response, so why's that a problem?
Edit: you can optionally pass to output up to three arguments: the set of morsel attributes you want in the output for each morsel (default None meaning all attributes), the header string you want to use in front of each morsel (default Set-Cookie:), the separator string you want between morsels (default \r\n). So it seems that your intended use of a cookie is single-morsel (otherwise you couldn't stick the string representation into a single header, which you appear most keen to do): in that case
thecookie.output(None, '')
will give you exactly the string you want. Just make multiple SimpleCookie instances with one morsel each (since one morsel is what fits into a single header!-).
Here's a quick way to get the cookies with no 3rd-party-libraries. While it only answers a section of the question, it may be answering the one which most "visitors" will be after.
import Cookie
def do_GET(self):
cookies = {}
cookies_string = self.headers.get('Cookie')
if cookies_string:
cookies = Cookie.SimpleCookie()
cookies.load(cookies_string)
if 'my-cookie' in cookies:
print(cookies['my-cookie'].value)

Categories

Resources