How to send headers in order using requests? - python

They said they can send headers in order here:
http://docs.python-requests.org/en/master/user/advanced/#header-ordering
But for some unknown reason requests never sends headers in order.
Example code:
headers01 = OrderedDict([("Connection", "close"), ("Upgrade-Insecure-Requests", "1"), ("User-Agent", "SomeAgent"), ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8"), ("Accept-Encoding", "gzip, deflate"), ("Accept-Language", "Some Language")])
Result:
Connection: close
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
Accept-Language: en-US,en;q=0.5
Upgrade-Insecure-Requests: 1
My request is already sent by sessions and it also not working if it's not sent by session.

If you read the documentation page you linked, it does indicate the limitation of default headers and the workaround...
Running this code:
import requests
from collections import OrderedDict
headers = OrderedDict([("Connection", "close"), ("Upgrade-Insecure-Requests", "1"), ("User-Agent", "SomeAgent"), ("Accept", "text/html,application/xhtml+xml,applic
ation/xml;q=0.9,image/webp,image/apng,/;q=0.8"), ("Accept-Encoding", "gzip, deflate"), ("Accept-Language", "Some Language")])
s = requests.Session()
s.headers = headers
r = s.get(http://localhost:6000/foo)
Sends:
GET /foo HTTP/1.1\r\nHost: localhost:6000\r\nConnection: close\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: SomeAgent\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8\r\nAccept-Encoding:
gzip, deflate\r\nAccept-Language: Some Language\r\n\r\n

You are in fact wrong: header order does not matter, not according to the standards anyway https://www.rfc-editor.org/rfc/rfc2616
The point you are trying to make (i.e. why it does matter) is that browsers can (somewhat unreliably) be identified by fingerprinting based on the header order they happen to use. This is fine, but it is by no means a reason for a Python library to implement specific ordering.
That you're disappointed you won't be able to use this library to impersonate some browser or to get accurately finger-printed by this type of software is too bad, but it hardly justified the tone of the question.
The best suggestion here would be to find an alternative http request library that does allow specific header ordering and guarantees maintaining the order you provide.

Related

Python socket - downloading files only work in chrome

So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.

Returning 403 Forbidden from simple get but loads okay in browser

I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?
The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers
These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.

Apparently fine http request results malformed when sent over socket

I'm working with socket operations and have coded a basic interception proxy in python. It works fine, but some hosts return 400 bad request responses.
These requests do not look malformed though. Here's one:
GET http://www.baltour.it/ HTTP/1.1
Host: www.baltour.it
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Same request, raw:
GET http://www.baltour.it/ HTTP/1.1\r\nHost: www.baltour.it\r\nUser-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-US,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\n\r\n
The code I use to send the request is the most basic socket operation (though I don't think the problem lies there, it works fine with most hosts)
socket_client.send(request_raw)
while socket_client.recv is used to get the response (but no problems here, the response is well-formed, though its status is 400).
Any ideas?
When not talking to a proxy, you are not supposed to put the http://hostname part in the HTTP header; see section 5.1.2 of the HTTP 1.1 RFC 2616 spec:
The most common form of Request-URI is that used to identify a resource on an origin server or gateway. In this case the absolute path of the URI MUST be transmitted (see section 3.2.1, abs_path) as the Request-URI, and the network location of the URI (authority) MUST be transmitted in a Host header field.
(emphasis mine); abs_path is the absolute path part of the request URI, not the full absolute URI itself.
E.g. the server expects you to send:
GET / HTTP/1.1
Host: www.baltour.it
A receiving server should be tolerant of the incorrect behaviour, however. The server seems to violate the RFC as well here too. Further on in the same section it reads:
To allow for transition to absoluteURIs in all requests in future versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI form in requests, even though HTTP/1.1 clients will only generate them in requests to proxies.

Running a Stack Overflow query from Python

I'm working on improving my answer to a question on Meta Stack Overflow. I want to run a search on some Stack Exchange site and detect whether I got any results. For example, I might run this query. When I run the query through my browser, I don't see the string "Your search returned no matches" anywhere in the html I get. But when I run this Python code:
"Your search returned no matches" in urllib2.urlopen("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22+").read()
I get True, and in fact the string contains a page that is clearly different from the one I get in my browser. How can I run the search in a way that gets me the same result I get when running the query in the normal, human way (from a browser)?
UPDATE: here's the same thing done with requests, as suggested by #ThiefMaster♦. Unfortunately it gets the same result.
"Your search returned no matches" in requests.get("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22").text
I used FireBug to view the header of the GET that runs when I run the search from my browser. Here it is:
GET /search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22 HTTP/1.1
Host: math.stackexchange.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://math.stackexchange.com/search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22
Cookie: __qca=P0-1687127815-1387065859894; __utma=27693923.779260871.1387065860.1393095534.1393101885.19; __utmz=27693923.1393095534.18.10.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/users/2829764/kuzzooroo; _ga=GA1.2.779260871.1387065860; mathuser=t=WM42SFDA5Uqr&s=OsFGcrXrl06g; sgt=id=bedc99bd-5dc9-42c7-85db-73cc80c4cc15; __utmc=27693923
Connection: keep-alive
Running requests.get with various pieces of this header didn't work for me, though I didn't try everything, and there are lots of possibilities.
Some sites create different results depending on which client connects to them. I do not know whether this is the case with stackoverflow. But I recognied it with wikis.
Here is what I do to pretend I am an Opera browser:
def openAsOpera(url):
u = urllib.URLopener()
u.addheaders = []
u.addheader('User-Agent', 'Opera/9.80 (Windows NT 6.1; WOW64; U; de) Presto/2.10.289 Version/12.01')
u.addheader('Accept-Language', 'de-DE,de;q=0.9,en;q=0.8')
u.addheader('Accept', 'text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1')
f = u.open(url)
content = f.read()
f.close()
return content
Surely you can adapt this to pretend the client is Firefox.

Python Mechanize Prevent Connection:Close

I'm trying to use mechanize to get information from a web page. It's basically succeeding in getting the first bit of information, but the web page includes a button for "Next" to get more information. I can't figure out how to programmatically get the additional information.
By using Live HTTP Headers, I can see the http request that is generated when I click the next button within a browser. It seems as if I can issue the same request using mechanize, but in the latter case, instead of getting the next page, I am redirected to the home page of the website.
Obviously, mechanize is doing something different than my browser is, but I can't figure out what. In comparing the headers, I did find one difference, which was the browser used
Connection: keep-alive
while mechanize used
Connection: close
I don't know if that's the culprit, but when I tried to add the header ('Connection','keep-alive'), it didn't change anything.
[UPDATE]
When I click the button for "page 2" within Firefox, the generated http is (according to Live HTTP Headers):
GET /statistics/movies/ww_load/the-fast-and-the-furious-6-2012?authenticity_token=ItU38334Qxh%2FRUW%2BhKoWk2qsPLwYKDfiNRoSuifo4ns%3D&facebook_fans_page=2&tbl=facebook_fans&authenticity_token=ItU38334Qxh%2FRUW%2BhKoWk2qsPLwYKDfiNRoSuifo4ns%3D HTTP/1.1
Host: www.boxoffice.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:18.0) Gecko/20100101 Firefox/18.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
X-Prototype-Version: 1.6.0.3
Referer: http://www.boxoffice.com/statistics/movies/the-fast-and-the-furious-6-2012
Cookie: __utma=179025207.1680379428.1359475480.1360001752.1360005948.13; __utmz=179025207.1359475480.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __qca=P0-668235205-1359475480409; zip=13421; country_code=US; _boxoffice_session=2202c6a47fc5eb92cd0ba57ef6fbd2c8; __utmc=179025207; user_credentials=d3adbc6ecf16c038fcbff11779ad16f528db8ebd470befeba69c38b8a107c38e9003c7977e32c28bfe3955909ddbf4034b9cc396dac4615a719eb47f49cc9eac%3A%3A15212; __utmb=179025207.2.10.1360005948
Connection: keep-alive
When I try to request the same url within mechanize, it looks like this:
GET /statistics/movies/ww_load/the-fast-and-the-furious-6-2012?facebook_fans_page=2&tbl=facebook_fans&authenticity_token=ZYcZzBHD3JPlupj%2F%2FYf4dQ42Kx9ZBW1gDCBuJ0xX8X4%3D HTTP/1.1
Accept-Encoding: identity
Host: www.boxoffice.com
Accept: text/javascript, text/html, application/xml, text/xml, */*
Keep-Alive: 115
Connection: close
Cookie: _boxoffice_session=ced53a0ca10caa9757fd56cd89f9983e; country_code=US; zip=13421; user_credentials=d3adbc6ecf16c038fcbff11779ad16f528db8ebd470befeba69c38b8a107c38e9003c7977e32c28bfe3955909ddbf4034b9cc396dac4615a719eb47f49cc9eac%3A%3A15212
Referer: http://www.boxoffice.com/statistics/movies/the-fast-and-the-furious-6-2012
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1
--
Daryl
The server was checking X-Requested-With and/or X-Prototype-Version, so adding those two headers to the mechanize request fixed it.
Maybe a little late with an answer but i fixed this by adding an line in _urllib2_forked.py
on line 1098 stands the line: headers["Connection"] = "Close"
Change this to:
if not 'Connection' in headers:
headers["Connection"] = "Close"
and make sure you set the header in you script and it will work.
Gr. Squandor

Categories

Resources