How to manipulate the content body of the GET request in Python - python

I know there are tons of ways to add headers or cookies something like this. But what I want to do is to add "\r\n" on the top of the request so as to look like the following body.
Request Body >>
\r\n <-- technically invisible..
GET /path/ HTTP/1.1
Host: www.website.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22
Referer: https://www.google.com/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,ko;q=0.6
Accept-Charset: windows-949,utf-8;q=0.7,*;q=0.3
\r\n is added on the first line of the GET request as you can see.
It's like adding an empty line.
How can I do this in Python?
I've spent hours on this topic but couldn't find any useful resources.
===================== ADD ============================================
It's about hacking.
In South Korea, Government restricts some sites, but the filters preventing users from connecting to the sites can easily be evaded by just adding "\r\n" on the top of the request body.
httplib2, httplib, urllib, urllib2, etc.. etc..
Whatever library to be used, I just need to add "\r\n" to the request body.

You could do this by monkeypatching the httplib.HTTPConnection class; urllib, urllib2, requests etc. all use that class to handle the low-level HTTP conversation.
The easiest is to patch the HTTPConnection._output() method to insert the extra characters before a HTTP version message:
from httplib import HTTPConnection, _CS_REQ_STARTED
orig_output = HTTPConnection._output
def add_initial_newline_output(self, s):
if (self._HTTPConnection__state == _CS_REQ_STARTED and
s.endswith(self._http_vsn_str) and not self._buffer):
self._buffer.append('') # will insert extra \r\n
orig_output(self, s)
HTTPConnection._output = add_initial_newline_output
This will only insert the extra starting empty line when the connection is in the correct state (request started), the line ends with the current HTTP version string, and the buffer is still empty.

Related

Python socket - downloading files only work in chrome

So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.

How do i login to this site with scrapy shell and python - 401 Error?

Im trying to login to this website, seeking.com/login through scrapy shell. i also installed burp suite to analyze its url and headers, etc.
from scrapy.http import FormRequest
frmdata = {"captcha":"","email":"MYEMAIL.com","password":"MY_PASSWORD","is_rememberme":"0","locale":"en_US","auth_type":"bearer_token","date":"2018-12-13T09:56:22.957Z"}
url = "https://www.seeking.com/v3/auth/login"
r = FormRequest(url, formdata=frmdata)
fetch(r)
with this code i get a HTTP 401 Error, as far as i can tell essentially an authentication error.
I forwarded the calls through burpsuite and got the following intercept.
POST /v3/auth/login HTTP/1.1
Host: www.seeking.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0)
Gecko/20100101 Firefox/63.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.seeking.com/login?fromLogout=1
Content-Type: application/json;charset=utf-8
Web-Version: 3.59
Authorization: Basic NTI2ZTIwYzExMDI0NDYzNTk5OTI5MzUwZThiNWEzMTI6bHN0emd4ZzpSRzRzS3VmdEJMRTQxMm92TnMxbDR6L0ZkZ1dESHZuM2wwZWxtYWhyMGtnPQ==
Content-Length: 166
Connection: close
Cookie: __cfduid=dcf9fd66583d55382f362c18a83d904ca1544519479;
_gcl_au=1.1.2035701377.1544519485; _ga=GA1.2.1740241044.1544519486;
com.silverpop.iMAWebCookie=e88c45d1-3c24-11c6-089e-e287aae2c678;
__cfruid=3eebbdc1e401ed560c23a7c474c41e59b2e93018-1544520179;
device_cookie=1; __gads=ID=a1e437c03ddad1b3:T=1544519579:S=ALNI_MYb30xY4z76J4NniCK_ZtOyOdPMKA;_lb_user=gfpuzje6kg; seeking_session=eyJpdiI6Im4yMTNJNVNRZjkxbnZzMmNpYnQ4dkE9PSIsInZhbHVlIjoiVGhGVUJDejc1dElJbEwxekh5d2hXUnhjeDlpVWR2dW9IWWJqeDZvRmI3VU9Pc1lpZXZGWGJxejQ1alNXbGVXUGJqaEpORU9LNFJITVh0N3IwR1E0bUE9PSIsIm1hYyI6IjUyODU3MWIxYjM3MGU3M2E0YjI1YzM2MzNmNDc5ZDMzZDdjYTg1ZWMxYWU2ODJjY2JlMTJmZWJlNmUyZDkyNWMifQ%3D%3D {"captcha":"","email":"MYEMAIL","password":"MYPASS","is_rememberme":0,"locale":"en_US","auth_type":"bearer_token","date":"2018-12-14T09:15:56.016Z"}
I am completely new to this, and have spent 2 days trying to figure out what i need to pass to this POST to login.
My question is
1) based on this intercept what should my request via FormRequest look like?
2) I see there are cookies/authorization (Authorization token, that changes with each POST, session cookies, etc) tokens that are being passed in to the post... Where do they come from? How do i get them when i am scraping so that i can successfully login?
3) Do i need to store these session variables when scraping other pages on the site after login? Anything special i need to do to stay logged in to access other pages?
It looks like the login page is expecting to be passed soon data, and not a url-encoded string (which is what FormRequest will create).
Something like this should work:
r = scrapy.Request(
url=url,
method='POST',
body=json.dumps(frmdata),
headers={'Content-Type': 'application/json'},
)
The tokens, cookies, etc. are probably created when you initially request the login page, so you might need to request the login page before trying to log in.
It is possible that some of it is generated with javascript (haven't checked), so you might need to dig through the js code to figure out what's going on, or even execute the js yourself (e.g. using a browser).
Scrapy will keep track of your session for you, so there's nothing you need to do to stay logged in.

Apparently fine http request results malformed when sent over socket

I'm working with socket operations and have coded a basic interception proxy in python. It works fine, but some hosts return 400 bad request responses.
These requests do not look malformed though. Here's one:
GET http://www.baltour.it/ HTTP/1.1
Host: www.baltour.it
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Same request, raw:
GET http://www.baltour.it/ HTTP/1.1\r\nHost: www.baltour.it\r\nUser-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-US,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\n\r\n
The code I use to send the request is the most basic socket operation (though I don't think the problem lies there, it works fine with most hosts)
socket_client.send(request_raw)
while socket_client.recv is used to get the response (but no problems here, the response is well-formed, though its status is 400).
Any ideas?
When not talking to a proxy, you are not supposed to put the http://hostname part in the HTTP header; see section 5.1.2 of the HTTP 1.1 RFC 2616 spec:
The most common form of Request-URI is that used to identify a resource on an origin server or gateway. In this case the absolute path of the URI MUST be transmitted (see section 3.2.1, abs_path) as the Request-URI, and the network location of the URI (authority) MUST be transmitted in a Host header field.
(emphasis mine); abs_path is the absolute path part of the request URI, not the full absolute URI itself.
E.g. the server expects you to send:
GET / HTTP/1.1
Host: www.baltour.it
A receiving server should be tolerant of the incorrect behaviour, however. The server seems to violate the RFC as well here too. Further on in the same section it reads:
To allow for transition to absoluteURIs in all requests in future versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI form in requests, even though HTTP/1.1 clients will only generate them in requests to proxies.

Filling a strange webform in python

I am trying build a simple program that fills a webform and then extracts the a data from the resulted website, it should be pretty straight-forward and after a short web research (mostly from this website) I came to the conclusion that python would be my best choice.(with urllib)
I would give a specific example of what I am trying to do and hopefully it will clarify things:
Lets say I want an automatic script that gets the price for hotel, the website would be:
the Hilton website
In that webpage I want to fill the query as follows: the "where are you going" should have N.Y.C and departure and arrival date.
If used from the browser, I will get a the next page with the results for the query I just filled and from that webpage I want to scrape the prices.
Lets assume that scraping the data can be done from the html source code.
Well, that's how I thought to do it (very general description....)
import urllib
import urllib.request
url = 'http://www3.hilton.com/en/index.html'
query_args = { 'searchQuery':' New York, NY ', 'arrivalDate':'31 Oct 2013' , 'departureDate': '04 Nov 2013' }
print (query_args)
data = urllib.parse.urlencode(query_args)
print (data)
request = urllib.request.Request(url);
response = urllib.request.urlopen(request,data)
html = response.read()
print (html)
First of all I get an the following error when I am sending the request:" TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str." anyone knows why?
If I got it right I would assume that I would accept the response for the following website. hilton_website with the urlencode segment in the end (searchQuery=+New+York%2C+NY&arrivalDate=31+Oct+2013&departureDate=04+Nov+2013)
But there is no such website ( when I am typing the full address to the browser url)
it looks like all the results (and it doesn't matter what was your query or how you filled the form ) appear on the same webpage: hilton_web_site...en_US/hi/search/findhotels/results.htm?view=LIST
So what am I doing wrong? I thought that the way to fill a web form.
Thanks a lot for the help
Here's a dump of hiltons search form (which, i instructed you how to get but can not paste it all in the comment section so here you go), i forgot to mention that you should check the "Form data" raw data in Chrome tho which was my bad.. anyway..
RAW data (so you can understand how a POST request works)
Request Header
POST /en_US/hi/search/findhotels/index.htm?WT.bid=Home,,,find_button HTTP/1.1
Host: www3.hilton.com
Connection: keep-alive
Content-Length: 475
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Origin: http://www3.hilton.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://www3.hilton.com/en/index.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: mmcore.tst=0.626; ClrCSTO=T; ClrSSID=1382526492268-11812; ClrOSSID=1382526492268-11812; ClrSCD=1382526492268; mmid=-1914085652%7CAQAAAAq0Q+DZtwkAAA%3D%3D; mmcore.pd=-1914085652%7CAQAAAAq0Q+DZtwkAAA%3D%3D; mmcore.srv=ldnvwcgus01; K3R7=3U24QMiCUbjeh65LoEI31TTjisY8czr4zkIUe06gsA4A5lc0bIKrEhQ; GWSESSIONID=qYDpSnSLNFnJ5CrJrwlSWW7CNBHL7vXSMndGmnxghGns1rLjt2lX!-1490734837; __atuvc=1%7C43; mm_pc_HHonorsPoints=false; mm_pers_storage=loggedin%3Ano%7CStayDuration%3A1-2%20nights%7CDaysToBooking%3A0-1%7CSatStay%3Ano%7CChildren%3Ano%7CBrand%3Ahi%7CHHonorsPoints%3Afalse%7CFlexDates%3Afalse%7CPromoCode%3Ano%7Cproperties8%3Ano%7Chotelcode%3Ano; WT_FPC=id=6f7c5181-8354-430b-b943-9bb95bf2c75c:lv=1382490515002:ss=1382490495203
Form data
searchQuery=N.Y.C&arrivalDate=23+Oct+2013&departureDate=24+Oct+2013&_flexibleDates=on&numberOfRooms=1&numberOfAdults%5B0%5D=1&numberOfChildren%5B0%5D=0&numberOfAdults%5B1%5D=1&numberOfChildren%5B1%5D=0&numberOfAdults%5B2%5D=1&numberOfChildren%5B2%5D=0&numberOfAdults%5B3%5D=1&numberOfChildren%5B3%5D=0&promoCode=&groupCode=&corporateId=&_travelAgentRate=on&_aaaRate=on&_aarpRate=on&_seniorRate=on&_governmentRate=on&offerId=&bookButton=false&searchType=ALL&roomKeyEnable=true
And these are they keys (not values) that you need to send with each search query:
searchQuery
arrivalDate
departureDate
_flexibleDates
numberOfRooms
numberOfAdults%5B0%5D
numberOfChildren%5B0%5D
numberOfAdults%5B1%5D
numberOfChildren%5B1%5D
numberOfAdults%5B2%5D
numberOfChildren%5B2%5D
numberOfAdults%5B3%5D
numberOfChildren%5B3%5D
promoCode
groupCode
corporateId
_travelAgentRate
_aaaRate
_aarpRate
_seniorRate
_governmentRate
offerId
bookButton
searchType
roomKeyEnable
And this is what a python build of the whole thing would look like:
from socket import *
s = socket()
POST = 'searchQuery=N.Y.C&arrivalDate=23+Oct+2013&departureDate=......'
header = ''
header += 'POST /en_US/hi/search/findhotels/index.htm?WT.bid=Home,,,find_button HTTP/1.1\r\n'
header += 'Host: www3.hilton.com\r\n'
header += 'Connection: keep-alive\r\n'
header += 'Content-Length: ' + str(len(POST)) + '\r\n'
header += 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\n'
header += 'Origin: http://www3.hilton.com\r\n'
header += 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \r\n'
header += 'Chrome/30.0.1599.101 Safari/537.36\r\n'
header += 'Content-Type: application/x-www-form-urlencoded\r\n'
header += 'Referer: http://www3.hilton.com/en/index.html\r\n'
header += 'Accept-Encoding: gzip,deflate,sdch\r\n'
header += 'Accept-Language: en-US,en;q=0.8\r\n'
header += '\r\n'
s.connect(('www3.hilton.com', 80))
s.send(header+POST)
print(s.recv(8192))
And that would give you what you're looking for.
Note that i left out the cookie field in the header.. normally the server doesn't care to much about it because it's an ancient technique and there for the server usually assumes that if no cookies was given, just give the client new ones and assume it's the first visit even tho (well technically it is but) it is'nt really the actual first page.
Also note:
All data in the POST variable must be URL encoded as such: urlencode(key)=urlencode(val)& for each pair. Note how i did not url-encode the = or the & in this raw POST data.
This is where you wouldn't need to worry about it because urllib does it for you. But this gives you a understanding of how HTTP requests work.

How to accept the GET request in Python socket application?

I am writing a very basic web server as a homework assignment and I have it running on localhost port 14000. When I browser to localhost:14000, the server sends back an HTML page with a form on it (the form's action is the same address - localhost:14000, not sure if that's proper or not).
Basically I want to be able to gather the data from the GET request once the page reloads after the submit - how can I do this? How can i access the stuff in the GET in general?
NOTE: I already tried socket.recv(xxx), that doesn't work if the page is being loaded first time - in that case we are not "receiving" anything from the client so it just keeps spinning.
The secret lies in conn.recv which will give you the headers sent by the browser/client of the request. If they look like the one I generated with safari you can easily parse them (even without a complex regex pattern).
data = conn.recv(1024)
#Parse headers
"""
data will now be something like this:
GET /?banana=True HTTP/1.1
Host: localhost:50008
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
"""
#A simple parsing of the get data would be:
GET={i.split("=")[0]:i.split("=")[1] for i in data.split("\n")[0].split(" ")[1][2:].split("&")}

Categories

Resources