I apologise if this is a daft question. I'm currently writing against a Django API (which I also maintain) and wish under certain circumstances to be able to generate multiple partial responses in the case where a single request yields a large number of objects, rather than sending the entire JSON structure as a single response.
Is there a technique to do this? It needs to follow a standard such that client systems using different request libraries would be able to make use of the functionality.
The issue is that the client system, at the point of asking, does not know the number of objects that will be present in the response.
If this is not possible, then I will have to chain requests on the client end - for example, getting the first 20 objects & if the response suggests there will be more, requesting the next 20 etc. This approach is an OK work-around, but any subsequent requests rely on the previous response. I'd rather ask once and have some kind of multi-part response.
As far as I know, No you can't send Multipart http response not yet atleast. Multipart response is only valid in http requests. Why? Because no browser as I know of completely supports this.
Firefox 3.5: Renders only the last part, others are ignored.
IE 8: Shows all the content as if it were text/plain, including the boundaries.
Chrome 3: Saves all the content in a single file, nothing is rendered.
Safari 4: Saves all the content in a single file, nothing is rendered.
Opera 10.10: Something weird. Starts rendering the first part as plain/text, and then clears everything. The loading progress bar hangs on 31%.
(Data credits Diego Jancic)
Related
So I need to build an HTTP server that will contact a client and send him data like pictures or calculations and create a page with those things. I guess you understood that I do not really know what I'm doing... :(
I know python and the basic(+) of the client-server project but I don't understand that HTTP protocol and didn't understand anything from what I read on the internet...
Can anyone explain to me how to work with this protocol? What is the form of HTTP packets?
Here an example of 1 problem that I don't understand: I have been asked to get a packet (which I did) and understand what is the request there, then send back the name of the file the client wants and after it the file itself. I printed the packet and didn't understand where is the request or what the client wants...
Thank you very very much!
Can anyone explain to me how to work with this protocol? What is the form of HTTP packets?
The specification might be helpful.
Concerning the webz, you find a lot of specification on the RFCs.
More to HTTP below.
(Since you seem to be new to programming, I figured I might want to tell you the following:)
Usually one doesn't directly interact with HTTP(S) packets. Instead you use a framework, such as flask, django, aiohttp and many more. The choice of framework depends on the use-case. E.g.:
You need a database, authentication and any imaginable feature? Go with Django.
You just want to create a WebApplication without a bloated framework? Go with Flask.
You need the bare minimum or want to act as a client? Go with aiohttp.
More frameworks are listed here.
The advantage of using such frameworks is that they usually include useful things, that are battletested (i.e. usually no bugs), while you don't have to figure out pecularities of certain protocols.
You just import the framework and write awesomeness! :)
(Anyways, here is a little very oversimplified overview for completeness)
So, HTTP is an text protocol over TCP, which basically means that you send text over a simple tcp socket. When you receive your request you have to "parse" (i.e. comprehend its contents). Luckily for us the requests are standarized and follow the same scheme.
The smallest request would look like this:
GET / HTTP/1.0
Host: www.server.com
The first line starts with a verb (also called request method), in our example the verb is GET. The / denotes the path. Think of file paths on your HDD. The last part of the first line, namely HTTP/1.0, tells the receiver with which version of HTTP we are operating on. Currently the there is HTTP 1.0 and HTTP 1.1; however, I wouldn't bother with HTTP 1.1 yet and stick with HTTP 1.0, if you're implementing the requests your self.
Lastly the Host: www.server.com line tells us which server we want to talk to, since multiple instances of an HTTP server could be running under the same ip. This is used to revole the subdomain.
If you send this request to an HTTP Server, you're likely to receive an response like this:
HTTP/1.0 200 OK
Server: Apache/1.3.29 (Unix) PHP/4.3.4
Content-Length: 1337
Connection: close
Content-Type: text/html
<DATA>
This response contains the status in the first line HTTP/1.0 200 OK. The number and the 'OK' represent a status code, telling us that everything is fine. There are many status codes with their own meaning and usages.
The lines following the first are so-called Response-Headers. They provide additional useful information about the response. For instance, when we open a site like 'stackoverflow.com', the server transmits an HTML file to us for the browser to interpret. Before we can do that, we need to know the size of the HTML file.
Luckily the server tells us beforehand with Content-Length: 1337 line, that the file is 1337 bytes big. The file itself would be present where the <DATA> placeholder stands.
There are, yet again, many of these headers.
As you can see, there are many things to account for when working with HTTP, showing that it is not feasible, without a very good reason, to implement a HTTP client/server from scratch.
Instead it's preferred to use one of the frameworks (for python) listed above.
As a last note:
In the process of trying to explain the concepts as simple as possible I probably left-out or oversimplified some things. If you find any mistake, please let me know.
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I am still fairly new to python and have some questions regarding using requests to sign in. I have read for hours but can't seem to get an answer to the following questions. If I choose a site such as www.amazon.com. I can sign in & determine the sign in link: https://www.amazon.com/gp/sign-in.html...
I can also find the sent form data, which includes items such as:
appActionToken:
appAction:SIGNIN
openid.pape.max_auth_age:ape:MA==
openid.return_to:
password: XXXX
email: XXXX
prevRID:
create:
metadata1: XXXX
my questions are as follows:
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
The following code should work, but doesn't. What am I doing wrong?
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
Anyone who could provide input and help me sign in with requests would be doing me a great favor. I thank you very much.
import requests
session = requests.Session()
data = {'email':'xxxxx', 'password':'xxxxx'}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
You generally need to either (a) read the documentation for the site you're using, if it's available, or (b) examine the HTML yourself (and possibly trace the http traffic) to see what parameters are necessary.
The following code should work, but doesn't. What am I doing wrong?
You didn't provide any details about how your code is not working.
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
The answer here is really the same as for the first question. Either you are using an API for which documentation exists that answers this question, or you're trying to automate a site that was designed primarily for human consumption via a web browser, which means you're going to have figure out through investigation, trial, and error exactly what parameters you need to provide to make the remote server happy.
Complexities aside, what is the simplest quirty-and-dirty way to detect in a request whether that request was issues by a CLI program, such as curl, or whether it was by a browser? Here is what I'm trying to figure out:
def view(request):
if request.is_from_browser:
return HTML_TEMPLATE
else:
return JSON
Request.is_ajax() checks if the HTTP_X_REQUESTED_WITH header equals XMLHttpRequest. This is becoming an "industry standard" among web frameworks/libraries to separate Ajax calls from normal requests. But it depends on cooperation from the client side to actually set the header. There's no 100 % foolproof way of detecting browser, client, Ajax etc without this cooperation.
Btw, why do you need to know what's calling?
Somthing in the HTTP request headers, I'd first try using the Accept header. with the accept header the client can specify what sort of content it wants.this puts the responsibily on the client.
I'm working on a REST API and I really like the idea of using content negotiation to determine what representations to send. My application is based on the Flask framework and so naturally I am working with the mimerender package. I have resource variant selection working for HTML, JSON, and XML. But then I tried it with a bogus mimetype, like application/foobar. I expected to see a 406 error code, but instead I got a 200 response code and the XML representation.
Looking at the source code, it appears that mimerender defaults to whatever mimetype is first in its list of mimetypes, which is XML at the moment.
My question is 2 parts:
The guy who wrote mimerender (I hope he reads this question) knows what he is doing, and he must have deliberately chosen to provide a default representation rather than a 406 error code for some good reason. What is the reason behind sending some (kinda random) representation rather than telling a client that you just don't have what they're asking for?
Assuming that I stubbornly don't want to send a default representation and that I prefer to send a 406 error instead, how can I do this within the confines of Flask and mimerender? One possibility I can think of is to register a fake mimetype, set that as the default, and call abort(406) in its handler. But that seems hacky.
I think I had not given this case enough thought. According to the spec, it's still ok to return an unacceptable response:
HTTP/1.1 servers are allowed to return responses which are
not acceptable according to the accept headers sent in the
request. In some cases, this may even be preferable to sending a
406 response. User agents are encouraged to inspect the headers of
an incoming response to determine if it is acceptable.
I have just added an optional boolean flag which makes mimerender fail with 406 instead. Hopefully that will cover your use case as well.