I am writing a very basic web server as a homework assignment and I have it running on localhost port 14000. When I browser to localhost:14000, the server sends back an HTML page with a form on it (the form's action is the same address - localhost:14000, not sure if that's proper or not).
Basically I want to be able to gather the data from the GET request once the page reloads after the submit - how can I do this? How can i access the stuff in the GET in general?
NOTE: I already tried socket.recv(xxx), that doesn't work if the page is being loaded first time - in that case we are not "receiving" anything from the client so it just keeps spinning.
The secret lies in conn.recv which will give you the headers sent by the browser/client of the request. If they look like the one I generated with safari you can easily parse them (even without a complex regex pattern).
data = conn.recv(1024)
#Parse headers
"""
data will now be something like this:
GET /?banana=True HTTP/1.1
Host: localhost:50008
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
"""
#A simple parsing of the get data would be:
GET={i.split("=")[0]:i.split("=")[1] for i in data.split("\n")[0].split(" ")[1][2:].split("&")}
Related
Was wondering why I am getting a 408 request timeout when sending an HTTP GET request using sockets. I just copied the GET request that was sent through Chrome and then pasted it into python figuring that I would get a 200 response, but clearly, I am missing something.
def GET():
headers = ("""GET / HTTP/1.1\r
Host: {insert host here}\r
Connection: close\r
Cache-Control: max-age=0\r
DNT: 1\r
Upgrade-Insecure-Requests: 1\r
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36\r
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r
Accept-Encoding: gzip, deflate\r
Accept-Language: en-US,en;q=0.9\r
Cookie: accept_cookies=1\r\n""").encode('ascii')
payload = headers
return payload
def activity1():
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((HOST, PORT))
user = GET()
sock.sendall(user)
poop = sock.recv(10000)
print(poop)
sock.close()
Assuming the hostname and port are defined correctly is there anything wrong with this request that would cause it to timeout? Thanks.
The initial problem is that the HTTP header is not properly finished, i.e. it is missing the final \r\n (empty line). Once this is done you will likely run into multiple other problems, like:
You are assuming that everything can be read within a single recv, which will only be true for short answers.
You likely assume that the body is a single byte buffer. But it can be transferred in chunks since HTTP/1.1 support this Transfer-Encoding.
You likely assume that the body is in plain. But it can be compressed since you explicitly accept gzip-compressed responses.
HTTP is not the simple protocol as it might look. Please read the actual standard before implementing it, see RFC 7230. Or just use a library which does the hard work for you.
So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.
Im trying to login to this website, seeking.com/login through scrapy shell. i also installed burp suite to analyze its url and headers, etc.
from scrapy.http import FormRequest
frmdata = {"captcha":"","email":"MYEMAIL.com","password":"MY_PASSWORD","is_rememberme":"0","locale":"en_US","auth_type":"bearer_token","date":"2018-12-13T09:56:22.957Z"}
url = "https://www.seeking.com/v3/auth/login"
r = FormRequest(url, formdata=frmdata)
fetch(r)
with this code i get a HTTP 401 Error, as far as i can tell essentially an authentication error.
I forwarded the calls through burpsuite and got the following intercept.
POST /v3/auth/login HTTP/1.1
Host: www.seeking.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0)
Gecko/20100101 Firefox/63.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.seeking.com/login?fromLogout=1
Content-Type: application/json;charset=utf-8
Web-Version: 3.59
Authorization: Basic NTI2ZTIwYzExMDI0NDYzNTk5OTI5MzUwZThiNWEzMTI6bHN0emd4ZzpSRzRzS3VmdEJMRTQxMm92TnMxbDR6L0ZkZ1dESHZuM2wwZWxtYWhyMGtnPQ==
Content-Length: 166
Connection: close
Cookie: __cfduid=dcf9fd66583d55382f362c18a83d904ca1544519479;
_gcl_au=1.1.2035701377.1544519485; _ga=GA1.2.1740241044.1544519486;
com.silverpop.iMAWebCookie=e88c45d1-3c24-11c6-089e-e287aae2c678;
__cfruid=3eebbdc1e401ed560c23a7c474c41e59b2e93018-1544520179;
device_cookie=1; __gads=ID=a1e437c03ddad1b3:T=1544519579:S=ALNI_MYb30xY4z76J4NniCK_ZtOyOdPMKA;_lb_user=gfpuzje6kg; seeking_session=eyJpdiI6Im4yMTNJNVNRZjkxbnZzMmNpYnQ4dkE9PSIsInZhbHVlIjoiVGhGVUJDejc1dElJbEwxekh5d2hXUnhjeDlpVWR2dW9IWWJqeDZvRmI3VU9Pc1lpZXZGWGJxejQ1alNXbGVXUGJqaEpORU9LNFJITVh0N3IwR1E0bUE9PSIsIm1hYyI6IjUyODU3MWIxYjM3MGU3M2E0YjI1YzM2MzNmNDc5ZDMzZDdjYTg1ZWMxYWU2ODJjY2JlMTJmZWJlNmUyZDkyNWMifQ%3D%3D {"captcha":"","email":"MYEMAIL","password":"MYPASS","is_rememberme":0,"locale":"en_US","auth_type":"bearer_token","date":"2018-12-14T09:15:56.016Z"}
I am completely new to this, and have spent 2 days trying to figure out what i need to pass to this POST to login.
My question is
1) based on this intercept what should my request via FormRequest look like?
2) I see there are cookies/authorization (Authorization token, that changes with each POST, session cookies, etc) tokens that are being passed in to the post... Where do they come from? How do i get them when i am scraping so that i can successfully login?
3) Do i need to store these session variables when scraping other pages on the site after login? Anything special i need to do to stay logged in to access other pages?
It looks like the login page is expecting to be passed soon data, and not a url-encoded string (which is what FormRequest will create).
Something like this should work:
r = scrapy.Request(
url=url,
method='POST',
body=json.dumps(frmdata),
headers={'Content-Type': 'application/json'},
)
The tokens, cookies, etc. are probably created when you initially request the login page, so you might need to request the login page before trying to log in.
It is possible that some of it is generated with javascript (haven't checked), so you might need to dig through the js code to figure out what's going on, or even execute the js yourself (e.g. using a browser).
Scrapy will keep track of your session for you, so there's nothing you need to do to stay logged in.
I know there are tons of ways to add headers or cookies something like this. But what I want to do is to add "\r\n" on the top of the request so as to look like the following body.
Request Body >>
\r\n <-- technically invisible..
GET /path/ HTTP/1.1
Host: www.website.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22
Referer: https://www.google.com/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,ko;q=0.6
Accept-Charset: windows-949,utf-8;q=0.7,*;q=0.3
\r\n is added on the first line of the GET request as you can see.
It's like adding an empty line.
How can I do this in Python?
I've spent hours on this topic but couldn't find any useful resources.
===================== ADD ============================================
It's about hacking.
In South Korea, Government restricts some sites, but the filters preventing users from connecting to the sites can easily be evaded by just adding "\r\n" on the top of the request body.
httplib2, httplib, urllib, urllib2, etc.. etc..
Whatever library to be used, I just need to add "\r\n" to the request body.
You could do this by monkeypatching the httplib.HTTPConnection class; urllib, urllib2, requests etc. all use that class to handle the low-level HTTP conversation.
The easiest is to patch the HTTPConnection._output() method to insert the extra characters before a HTTP version message:
from httplib import HTTPConnection, _CS_REQ_STARTED
orig_output = HTTPConnection._output
def add_initial_newline_output(self, s):
if (self._HTTPConnection__state == _CS_REQ_STARTED and
s.endswith(self._http_vsn_str) and not self._buffer):
self._buffer.append('') # will insert extra \r\n
orig_output(self, s)
HTTPConnection._output = add_initial_newline_output
This will only insert the extra starting empty line when the connection is in the correct state (request started), the line ends with the current HTTP version string, and the buffer is still empty.
I'm working on improving my answer to a question on Meta Stack Overflow. I want to run a search on some Stack Exchange site and detect whether I got any results. For example, I might run this query. When I run the query through my browser, I don't see the string "Your search returned no matches" anywhere in the html I get. But when I run this Python code:
"Your search returned no matches" in urllib2.urlopen("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22+").read()
I get True, and in fact the string contains a page that is clearly different from the one I get in my browser. How can I run the search in a way that gets me the same result I get when running the query in the normal, human way (from a browser)?
UPDATE: here's the same thing done with requests, as suggested by #ThiefMaster♦. Unfortunately it gets the same result.
"Your search returned no matches" in requests.get("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22").text
I used FireBug to view the header of the GET that runs when I run the search from my browser. Here it is:
GET /search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22 HTTP/1.1
Host: math.stackexchange.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://math.stackexchange.com/search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22
Cookie: __qca=P0-1687127815-1387065859894; __utma=27693923.779260871.1387065860.1393095534.1393101885.19; __utmz=27693923.1393095534.18.10.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/users/2829764/kuzzooroo; _ga=GA1.2.779260871.1387065860; mathuser=t=WM42SFDA5Uqr&s=OsFGcrXrl06g; sgt=id=bedc99bd-5dc9-42c7-85db-73cc80c4cc15; __utmc=27693923
Connection: keep-alive
Running requests.get with various pieces of this header didn't work for me, though I didn't try everything, and there are lots of possibilities.
Some sites create different results depending on which client connects to them. I do not know whether this is the case with stackoverflow. But I recognied it with wikis.
Here is what I do to pretend I am an Opera browser:
def openAsOpera(url):
u = urllib.URLopener()
u.addheaders = []
u.addheader('User-Agent', 'Opera/9.80 (Windows NT 6.1; WOW64; U; de) Presto/2.10.289 Version/12.01')
u.addheader('Accept-Language', 'de-DE,de;q=0.9,en;q=0.8')
u.addheader('Accept', 'text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1')
f = u.open(url)
content = f.read()
f.close()
return content
Surely you can adapt this to pretend the client is Firefox.