Simple way to detect browser or script - python

Complexities aside, what is the simplest quirty-and-dirty way to detect in a request whether that request was issues by a CLI program, such as curl, or whether it was by a browser? Here is what I'm trying to figure out:
def view(request):
if request.is_from_browser:
return HTML_TEMPLATE
else:
return JSON

Request.is_ajax() checks if the HTTP_X_REQUESTED_WITH header equals XMLHttpRequest. This is becoming an "industry standard" among web frameworks/libraries to separate Ajax calls from normal requests. But it depends on cooperation from the client side to actually set the header. There's no 100 % foolproof way of detecting browser, client, Ajax etc without this cooperation.
Btw, why do you need to know what's calling?

Somthing in the HTTP request headers, I'd first try using the Accept header. with the accept header the client can specify what sort of content it wants.this puts the responsibily on the client.

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

How to send Http response to specific requestLine in python?

Can I send a http response to a specific requestLine outside the function I received the request in it.
As an example I receive the request then pass this request to some functions and I want to send a response to this request, if there is allowed?
If you're going to split the path your program is taking, so it's doing two different things at the same time, then you'll need to utilize control-flow constructs that do this like threads/processes/events/async.
You might want to look at Celery (http://www.celeryproject.org/) and Jobtastic (http://policystat.github.io/jobtastic/).

Can Django send multi-part responses for a single request?

I apologise if this is a daft question. I'm currently writing against a Django API (which I also maintain) and wish under certain circumstances to be able to generate multiple partial responses in the case where a single request yields a large number of objects, rather than sending the entire JSON structure as a single response.
Is there a technique to do this? It needs to follow a standard such that client systems using different request libraries would be able to make use of the functionality.
The issue is that the client system, at the point of asking, does not know the number of objects that will be present in the response.
If this is not possible, then I will have to chain requests on the client end - for example, getting the first 20 objects & if the response suggests there will be more, requesting the next 20 etc. This approach is an OK work-around, but any subsequent requests rely on the previous response. I'd rather ask once and have some kind of multi-part response.
As far as I know, No you can't send Multipart http response not yet atleast. Multipart response is only valid in http requests. Why? Because no browser as I know of completely supports this.
Firefox 3.5: Renders only the last part, others are ignored.
IE 8: Shows all the content as if it were text/plain, including the boundaries.
Chrome 3: Saves all the content in a single file, nothing is rendered.
Safari 4: Saves all the content in a single file, nothing is rendered.
Opera 10.10: Something weird. Starts rendering the first part as plain/text, and then clears everything. The loading progress bar hangs on 31%.
(Data credits Diego Jancic)

Python-Scapy HTTP Traffic Manipulation

I need to intercept an HTTP Response packet from the server and replace it with my own response, or at least modify that response, before it arrives to my browser.
I'm already able to sniff this response and print it, the problem is with manipulating/replacing it.
Is there a way to do so wiht scapy library ?
Or do i have to connect my browser through a proxy to manipulate the response ?
If you want to work from your ordinary browser, then you need proxy between browser and server in order to manipulate it. E.g. see https://portswigger.net/burp/ which is a proxy specifically created for penetration testing with easy replacing of responses/requests (which is sriptable, too).
If you want to script all your session in scapy, then you can create requests and responses to your liking, but response does not go to the browser. Also, you can record ordinary web session (with tcpdump/wireshark/scapy) into pcap, then use scapy to read pcap modify it and send similar requests to the server.

is there a way to partially return results from Python web service....?

I am new to python. I am using Flask for creating a web service which makes lots of api calls to linkedin. The problem with this is getting the final result set lot of time and frontend remains idle for this time. I was thinking of returning partial results found till that point and continuing api calling at server side. Is there any way to do it in Python? Thanks.
Flask has the ability to stream data back to the client. Sometimes this requires javascript modifications to do what you want but it is possible to send content to a user in chunks using flask and jinja2. It requires some wrangling but it's doable.
A view that uses a generator to break up content could look like this (though the linked to SO answer is much more comprehensive).
from flask import Response
#app.route('/image')
def generate_large_image():
def generate():
while True:
if not processing_finished():
yield ""
else:
yield get_image()
return Response(generate(), mimetype='image/jpeg')
There are a few ways to do this. The simplest would be to return the initial request via flask immediately and then use Javascript on the page you returned to make an additional request to another URL and load that when it comes back. Maybe displaying a loading indicator or something.
The additional URL would look like this
#app.route("/linkedin-data")
def linkedin():
# make some call to the linked in api which returns "data", probably in json
return flask.jsonify(**data)
Fundamentally, no. You can't return a partial request. So you have to break your requests up into smaller units. You can stream data using websockets. But you would still be sending back an initial request, which would then create a websocket connection using Javascript, which would then start streaming data back to the user.

Categories

Resources