I'm currently needing to make a single HTTP request to a bunch of servers that I have in a list however, these HTTP requests contain a 'Connection' header, which I need to remove. How would I do this?
I had the same issue with Accept-Encoding, but I was able to comment out the section that automatically applies that header in httpclient.py (I'm using the requests lib for this). Is there any way around this aside from using sockets and sending raw HTTP requests? Is there maybe another snippet that can be commented out to prevent the header from being automatically added?
I realize that removing the header is a bad idea in the real world, but there's justification for this and it 100% needs to be removed.
I've tried assign it as empty and NoneType, both appear to be failing. I'm wondering if this is something I can't change.
Seems you can actually patch this out by commenting out line 1180 in urllib2.py
headers["Connection"] = "close"
Related
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I apologise if this is a daft question. I'm currently writing against a Django API (which I also maintain) and wish under certain circumstances to be able to generate multiple partial responses in the case where a single request yields a large number of objects, rather than sending the entire JSON structure as a single response.
Is there a technique to do this? It needs to follow a standard such that client systems using different request libraries would be able to make use of the functionality.
The issue is that the client system, at the point of asking, does not know the number of objects that will be present in the response.
If this is not possible, then I will have to chain requests on the client end - for example, getting the first 20 objects & if the response suggests there will be more, requesting the next 20 etc. This approach is an OK work-around, but any subsequent requests rely on the previous response. I'd rather ask once and have some kind of multi-part response.
As far as I know, No you can't send Multipart http response not yet atleast. Multipart response is only valid in http requests. Why? Because no browser as I know of completely supports this.
Firefox 3.5: Renders only the last part, others are ignored.
IE 8: Shows all the content as if it were text/plain, including the boundaries.
Chrome 3: Saves all the content in a single file, nothing is rendered.
Safari 4: Saves all the content in a single file, nothing is rendered.
Opera 10.10: Something weird. Starts rendering the first part as plain/text, and then clears everything. The loading progress bar hangs on 31%.
(Data credits Diego Jancic)
I'm trying to code my own Python 3 http library to learn more about sockets and the Http protocol. My question is, if a do a recv(bytesToRead) using my socket, how can I get only the header and then with the Content-Length information, continue recieving the page content? Isn't that the purpose of the Content-Length header?
Thanks in advance
In the past to accomplish this, I will read a portion of the socket data into memory, and then read from that buffer until a "\r\n\r\n" sequence is encountered (you could use a state machine to do this or simply use the string.find() function. Once you reach that sequence you know all of the headers have been read and you can do some parsing of the headers and then read the entire content length. You may need to be prepared to read a response that does not include a content-length header since not all responses contain it.
If you run out of buffer before seeing that sequence, simply read more data from the socket into your buffer and continue processing.
I can post a C# example if you would like to look at it.
I'm still struggling to Stream a file to the HTTP response in Pylons. In addition to the original problem, I'm finding that I cannot return the Content-Length header, so that for large files the client cannot estimate how long the download will take. I've tried
response.content_length = 12345
and I've tried
response.headers['Content-Length'] = 12345
In both cases the HTTP response (viewed in Fiddler) simply does not contain the Content-Length header. How do I get Pylons to return this header?
(Oh, and if you have any ideas on making it stream the file please reply to the original question - I'm all out of ideas there.)
Edit: while not a generic solution, for serving static files FileApp allows sending the Content-Length header. For dynamic content it looks like Alex Martelli's answer is the only option.
There's a bit of middleware code here that ensures all responses get a content length header if they're missing it. You could tweak it so that you set some other header in your response (say 'X-The-Content-Length') and the middleware uses that to make the content length if the latter's missing. I view the whole thing as a workaround for what I consider a pylons bug (its cavalier attitude to content length!) but apparently the pylons authors disagree with me on that score, so it's nice to at least have workarounds for it!-)
Try:
response.headerlist.append((str("Content-Length"), str(" 123456")))
I wish to "retrieve" the cookies sent by the client in my subclass of BaseHTTPRequestHandler.
Firstly I'm unsure of the exact sequence of sending of headers, in a typical HTTP request and response this is my understanding of the sequence of events:
Client sends request (method, path, HTTP version, host, and ALL headers).
The server responds with a response code, followed by a bunch of headers of its own.
The server then sends the body of the response.
When exactly is the client's POST data sent? Does any overlapping occur in this sequence as described above?
Second, when is it safe to assume that the "Cookie" header has been received by the server. Should all of the client headers have been received by the time self.send_response is called by the server? When in the HTTP communication is the appropriate time to peek at cookie headers in self.headers?
Thirdly, what is the canonical way to parse cookies in Python. I currently believe a Cookie.SimpleCookie should be instantiated, and then data from the cookie headers some how fed into it. Further clouding this problem, is the Cookie classes clunkiness when dealing with the HTTPRequestHandler interfaces. Why does the output from Cookie.output() not end with a line terminator to fit into self.wfile.write(cookie.output()), or instead drop the implicitly provided header name to fit nicely into self.send_header("Set-Cookie", cookie.output())
Finally, the cookie classes in the Cookie module, give the illusion that they're dictionaries of dictionaries. Assigning to different keys in the cookie, does not pack more data into the cookie, but rather generates more cookies... all apparently in the one class, and each generating its own Set-Cookie header. What is the best practise for packing containers of values into cookie(s)?
HTTP is a request/response protocol, without overlap; the body of a POST comes as part of the request (when the verb is POST, of course).
All headers also come as part of the request, including Cookie: if any (there might be no such header of course, e.g. when the browser is running with cookies disabled or whatever). So peek at the headers whenever you've received the request and are serving it.
I'm not sure what your "thirdly" problem is. No newline gets inserted if none is part of the cookie -- why ever should it be? Edit: see later.
On the fourth point, I think you may be confusing cookies with "morsels". There is no limit to the number of Set-Cookie headers in the HTTP response, so why's that a problem?
Edit: you can optionally pass to output up to three arguments: the set of morsel attributes you want in the output for each morsel (default None meaning all attributes), the header string you want to use in front of each morsel (default Set-Cookie:), the separator string you want between morsels (default \r\n). So it seems that your intended use of a cookie is single-morsel (otherwise you couldn't stick the string representation into a single header, which you appear most keen to do): in that case
thecookie.output(None, '')
will give you exactly the string you want. Just make multiple SimpleCookie instances with one morsel each (since one morsel is what fits into a single header!-).
Here's a quick way to get the cookies with no 3rd-party-libraries. While it only answers a section of the question, it may be answering the one which most "visitors" will be after.
import Cookie
def do_GET(self):
cookies = {}
cookies_string = self.headers.get('Cookie')
if cookies_string:
cookies = Cookie.SimpleCookie()
cookies.load(cookies_string)
if 'my-cookie' in cookies:
print(cookies['my-cookie'].value)