I'm working on improving my answer to a question on Meta Stack Overflow. I want to run a search on some Stack Exchange site and detect whether I got any results. For example, I might run this query. When I run the query through my browser, I don't see the string "Your search returned no matches" anywhere in the html I get. But when I run this Python code:
"Your search returned no matches" in urllib2.urlopen("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22+").read()
I get True, and in fact the string contains a page that is clearly different from the one I get in my browser. How can I run the search in a way that gets me the same result I get when running the query in the normal, human way (from a browser)?
UPDATE: here's the same thing done with requests, as suggested by #ThiefMaster♦. Unfortunately it gets the same result.
"Your search returned no matches" in requests.get("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22").text
I used FireBug to view the header of the GET that runs when I run the search from my browser. Here it is:
GET /search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22 HTTP/1.1
Host: math.stackexchange.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://math.stackexchange.com/search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22
Cookie: __qca=P0-1687127815-1387065859894; __utma=27693923.779260871.1387065860.1393095534.1393101885.19; __utmz=27693923.1393095534.18.10.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/users/2829764/kuzzooroo; _ga=GA1.2.779260871.1387065860; mathuser=t=WM42SFDA5Uqr&s=OsFGcrXrl06g; sgt=id=bedc99bd-5dc9-42c7-85db-73cc80c4cc15; __utmc=27693923
Connection: keep-alive
Running requests.get with various pieces of this header didn't work for me, though I didn't try everything, and there are lots of possibilities.
Some sites create different results depending on which client connects to them. I do not know whether this is the case with stackoverflow. But I recognied it with wikis.
Here is what I do to pretend I am an Opera browser:
def openAsOpera(url):
u = urllib.URLopener()
u.addheaders = []
u.addheader('User-Agent', 'Opera/9.80 (Windows NT 6.1; WOW64; U; de) Presto/2.10.289 Version/12.01')
u.addheader('Accept-Language', 'de-DE,de;q=0.9,en;q=0.8')
u.addheader('Accept', 'text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1')
f = u.open(url)
content = f.read()
f.close()
return content
Surely you can adapt this to pretend the client is Firefox.
Related
So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.
They said they can send headers in order here:
http://docs.python-requests.org/en/master/user/advanced/#header-ordering
But for some unknown reason requests never sends headers in order.
Example code:
headers01 = OrderedDict([("Connection", "close"), ("Upgrade-Insecure-Requests", "1"), ("User-Agent", "SomeAgent"), ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8"), ("Accept-Encoding", "gzip, deflate"), ("Accept-Language", "Some Language")])
Result:
Connection: close
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
Accept-Language: en-US,en;q=0.5
Upgrade-Insecure-Requests: 1
My request is already sent by sessions and it also not working if it's not sent by session.
If you read the documentation page you linked, it does indicate the limitation of default headers and the workaround...
Running this code:
import requests
from collections import OrderedDict
headers = OrderedDict([("Connection", "close"), ("Upgrade-Insecure-Requests", "1"), ("User-Agent", "SomeAgent"), ("Accept", "text/html,application/xhtml+xml,applic
ation/xml;q=0.9,image/webp,image/apng,/;q=0.8"), ("Accept-Encoding", "gzip, deflate"), ("Accept-Language", "Some Language")])
s = requests.Session()
s.headers = headers
r = s.get(http://localhost:6000/foo)
Sends:
GET /foo HTTP/1.1\r\nHost: localhost:6000\r\nConnection: close\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: SomeAgent\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8\r\nAccept-Encoding:
gzip, deflate\r\nAccept-Language: Some Language\r\n\r\n
You are in fact wrong: header order does not matter, not according to the standards anyway https://www.rfc-editor.org/rfc/rfc2616
The point you are trying to make (i.e. why it does matter) is that browsers can (somewhat unreliably) be identified by fingerprinting based on the header order they happen to use. This is fine, but it is by no means a reason for a Python library to implement specific ordering.
That you're disappointed you won't be able to use this library to impersonate some browser or to get accurately finger-printed by this type of software is too bad, but it hardly justified the tone of the question.
The best suggestion here would be to find an alternative http request library that does allow specific header ordering and guarantees maintaining the order you provide.
I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?
The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers
These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.
I'm trying to use mechanize to get information from a web page. It's basically succeeding in getting the first bit of information, but the web page includes a button for "Next" to get more information. I can't figure out how to programmatically get the additional information.
By using Live HTTP Headers, I can see the http request that is generated when I click the next button within a browser. It seems as if I can issue the same request using mechanize, but in the latter case, instead of getting the next page, I am redirected to the home page of the website.
Obviously, mechanize is doing something different than my browser is, but I can't figure out what. In comparing the headers, I did find one difference, which was the browser used
Connection: keep-alive
while mechanize used
Connection: close
I don't know if that's the culprit, but when I tried to add the header ('Connection','keep-alive'), it didn't change anything.
[UPDATE]
When I click the button for "page 2" within Firefox, the generated http is (according to Live HTTP Headers):
GET /statistics/movies/ww_load/the-fast-and-the-furious-6-2012?authenticity_token=ItU38334Qxh%2FRUW%2BhKoWk2qsPLwYKDfiNRoSuifo4ns%3D&facebook_fans_page=2&tbl=facebook_fans&authenticity_token=ItU38334Qxh%2FRUW%2BhKoWk2qsPLwYKDfiNRoSuifo4ns%3D HTTP/1.1
Host: www.boxoffice.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:18.0) Gecko/20100101 Firefox/18.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
X-Prototype-Version: 1.6.0.3
Referer: http://www.boxoffice.com/statistics/movies/the-fast-and-the-furious-6-2012
Cookie: __utma=179025207.1680379428.1359475480.1360001752.1360005948.13; __utmz=179025207.1359475480.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __qca=P0-668235205-1359475480409; zip=13421; country_code=US; _boxoffice_session=2202c6a47fc5eb92cd0ba57ef6fbd2c8; __utmc=179025207; user_credentials=d3adbc6ecf16c038fcbff11779ad16f528db8ebd470befeba69c38b8a107c38e9003c7977e32c28bfe3955909ddbf4034b9cc396dac4615a719eb47f49cc9eac%3A%3A15212; __utmb=179025207.2.10.1360005948
Connection: keep-alive
When I try to request the same url within mechanize, it looks like this:
GET /statistics/movies/ww_load/the-fast-and-the-furious-6-2012?facebook_fans_page=2&tbl=facebook_fans&authenticity_token=ZYcZzBHD3JPlupj%2F%2FYf4dQ42Kx9ZBW1gDCBuJ0xX8X4%3D HTTP/1.1
Accept-Encoding: identity
Host: www.boxoffice.com
Accept: text/javascript, text/html, application/xml, text/xml, */*
Keep-Alive: 115
Connection: close
Cookie: _boxoffice_session=ced53a0ca10caa9757fd56cd89f9983e; country_code=US; zip=13421; user_credentials=d3adbc6ecf16c038fcbff11779ad16f528db8ebd470befeba69c38b8a107c38e9003c7977e32c28bfe3955909ddbf4034b9cc396dac4615a719eb47f49cc9eac%3A%3A15212
Referer: http://www.boxoffice.com/statistics/movies/the-fast-and-the-furious-6-2012
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1
--
Daryl
The server was checking X-Requested-With and/or X-Prototype-Version, so adding those two headers to the mechanize request fixed it.
Maybe a little late with an answer but i fixed this by adding an line in _urllib2_forked.py
on line 1098 stands the line: headers["Connection"] = "Close"
Change this to:
if not 'Connection' in headers:
headers["Connection"] = "Close"
and make sure you set the header in you script and it will work.
Gr. Squandor
I am writing a very basic web server as a homework assignment and I have it running on localhost port 14000. When I browser to localhost:14000, the server sends back an HTML page with a form on it (the form's action is the same address - localhost:14000, not sure if that's proper or not).
Basically I want to be able to gather the data from the GET request once the page reloads after the submit - how can I do this? How can i access the stuff in the GET in general?
NOTE: I already tried socket.recv(xxx), that doesn't work if the page is being loaded first time - in that case we are not "receiving" anything from the client so it just keeps spinning.
The secret lies in conn.recv which will give you the headers sent by the browser/client of the request. If they look like the one I generated with safari you can easily parse them (even without a complex regex pattern).
data = conn.recv(1024)
#Parse headers
"""
data will now be something like this:
GET /?banana=True HTTP/1.1
Host: localhost:50008
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
"""
#A simple parsing of the get data would be:
GET={i.split("=")[0]:i.split("=")[1] for i in data.split("\n")[0].split(" ")[1][2:].split("&")}