Filling a strange webform in python - python

I am trying build a simple program that fills a webform and then extracts the a data from the resulted website, it should be pretty straight-forward and after a short web research (mostly from this website) I came to the conclusion that python would be my best choice.(with urllib)
I would give a specific example of what I am trying to do and hopefully it will clarify things:
Lets say I want an automatic script that gets the price for hotel, the website would be:
the Hilton website
In that webpage I want to fill the query as follows: the "where are you going" should have N.Y.C and departure and arrival date.
If used from the browser, I will get a the next page with the results for the query I just filled and from that webpage I want to scrape the prices.
Lets assume that scraping the data can be done from the html source code.
Well, that's how I thought to do it (very general description....)
import urllib
import urllib.request
url = 'http://www3.hilton.com/en/index.html'
query_args = { 'searchQuery':' New York, NY ', 'arrivalDate':'31 Oct 2013' , 'departureDate': '04 Nov 2013' }
print (query_args)
data = urllib.parse.urlencode(query_args)
print (data)
request = urllib.request.Request(url);
response = urllib.request.urlopen(request,data)
html = response.read()
print (html)
First of all I get an the following error when I am sending the request:" TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str." anyone knows why?
If I got it right I would assume that I would accept the response for the following website. hilton_website with the urlencode segment in the end (searchQuery=+New+York%2C+NY&arrivalDate=31+Oct+2013&departureDate=04+Nov+2013)
But there is no such website ( when I am typing the full address to the browser url)
it looks like all the results (and it doesn't matter what was your query or how you filled the form ) appear on the same webpage: hilton_web_site...en_US/hi/search/findhotels/results.htm?view=LIST
So what am I doing wrong? I thought that the way to fill a web form.
Thanks a lot for the help

Here's a dump of hiltons search form (which, i instructed you how to get but can not paste it all in the comment section so here you go), i forgot to mention that you should check the "Form data" raw data in Chrome tho which was my bad.. anyway..
RAW data (so you can understand how a POST request works)
Request Header
POST /en_US/hi/search/findhotels/index.htm?WT.bid=Home,,,find_button HTTP/1.1
Host: www3.hilton.com
Connection: keep-alive
Content-Length: 475
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Origin: http://www3.hilton.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://www3.hilton.com/en/index.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: mmcore.tst=0.626; ClrCSTO=T; ClrSSID=1382526492268-11812; ClrOSSID=1382526492268-11812; ClrSCD=1382526492268; mmid=-1914085652%7CAQAAAAq0Q+DZtwkAAA%3D%3D; mmcore.pd=-1914085652%7CAQAAAAq0Q+DZtwkAAA%3D%3D; mmcore.srv=ldnvwcgus01; K3R7=3U24QMiCUbjeh65LoEI31TTjisY8czr4zkIUe06gsA4A5lc0bIKrEhQ; GWSESSIONID=qYDpSnSLNFnJ5CrJrwlSWW7CNBHL7vXSMndGmnxghGns1rLjt2lX!-1490734837; __atuvc=1%7C43; mm_pc_HHonorsPoints=false; mm_pers_storage=loggedin%3Ano%7CStayDuration%3A1-2%20nights%7CDaysToBooking%3A0-1%7CSatStay%3Ano%7CChildren%3Ano%7CBrand%3Ahi%7CHHonorsPoints%3Afalse%7CFlexDates%3Afalse%7CPromoCode%3Ano%7Cproperties8%3Ano%7Chotelcode%3Ano; WT_FPC=id=6f7c5181-8354-430b-b943-9bb95bf2c75c:lv=1382490515002:ss=1382490495203
Form data
searchQuery=N.Y.C&arrivalDate=23+Oct+2013&departureDate=24+Oct+2013&_flexibleDates=on&numberOfRooms=1&numberOfAdults%5B0%5D=1&numberOfChildren%5B0%5D=0&numberOfAdults%5B1%5D=1&numberOfChildren%5B1%5D=0&numberOfAdults%5B2%5D=1&numberOfChildren%5B2%5D=0&numberOfAdults%5B3%5D=1&numberOfChildren%5B3%5D=0&promoCode=&groupCode=&corporateId=&_travelAgentRate=on&_aaaRate=on&_aarpRate=on&_seniorRate=on&_governmentRate=on&offerId=&bookButton=false&searchType=ALL&roomKeyEnable=true
And these are they keys (not values) that you need to send with each search query:
searchQuery
arrivalDate
departureDate
_flexibleDates
numberOfRooms
numberOfAdults%5B0%5D
numberOfChildren%5B0%5D
numberOfAdults%5B1%5D
numberOfChildren%5B1%5D
numberOfAdults%5B2%5D
numberOfChildren%5B2%5D
numberOfAdults%5B3%5D
numberOfChildren%5B3%5D
promoCode
groupCode
corporateId
_travelAgentRate
_aaaRate
_aarpRate
_seniorRate
_governmentRate
offerId
bookButton
searchType
roomKeyEnable
And this is what a python build of the whole thing would look like:
from socket import *
s = socket()
POST = 'searchQuery=N.Y.C&arrivalDate=23+Oct+2013&departureDate=......'
header = ''
header += 'POST /en_US/hi/search/findhotels/index.htm?WT.bid=Home,,,find_button HTTP/1.1\r\n'
header += 'Host: www3.hilton.com\r\n'
header += 'Connection: keep-alive\r\n'
header += 'Content-Length: ' + str(len(POST)) + '\r\n'
header += 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\n'
header += 'Origin: http://www3.hilton.com\r\n'
header += 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \r\n'
header += 'Chrome/30.0.1599.101 Safari/537.36\r\n'
header += 'Content-Type: application/x-www-form-urlencoded\r\n'
header += 'Referer: http://www3.hilton.com/en/index.html\r\n'
header += 'Accept-Encoding: gzip,deflate,sdch\r\n'
header += 'Accept-Language: en-US,en;q=0.8\r\n'
header += '\r\n'
s.connect(('www3.hilton.com', 80))
s.send(header+POST)
print(s.recv(8192))
And that would give you what you're looking for.
Note that i left out the cookie field in the header.. normally the server doesn't care to much about it because it's an ancient technique and there for the server usually assumes that if no cookies was given, just give the client new ones and assume it's the first visit even tho (well technically it is but) it is'nt really the actual first page.
Also note:
All data in the POST variable must be URL encoded as such: urlencode(key)=urlencode(val)& for each pair. Note how i did not url-encode the = or the & in this raw POST data.
This is where you wouldn't need to worry about it because urllib does it for you. But this gives you a understanding of how HTTP requests work.

Related

Python socket - downloading files only work in chrome

So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.

How do i login to this site with scrapy shell and python - 401 Error?

Im trying to login to this website, seeking.com/login through scrapy shell. i also installed burp suite to analyze its url and headers, etc.
from scrapy.http import FormRequest
frmdata = {"captcha":"","email":"MYEMAIL.com","password":"MY_PASSWORD","is_rememberme":"0","locale":"en_US","auth_type":"bearer_token","date":"2018-12-13T09:56:22.957Z"}
url = "https://www.seeking.com/v3/auth/login"
r = FormRequest(url, formdata=frmdata)
fetch(r)
with this code i get a HTTP 401 Error, as far as i can tell essentially an authentication error.
I forwarded the calls through burpsuite and got the following intercept.
POST /v3/auth/login HTTP/1.1
Host: www.seeking.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0)
Gecko/20100101 Firefox/63.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.seeking.com/login?fromLogout=1
Content-Type: application/json;charset=utf-8
Web-Version: 3.59
Authorization: Basic NTI2ZTIwYzExMDI0NDYzNTk5OTI5MzUwZThiNWEzMTI6bHN0emd4ZzpSRzRzS3VmdEJMRTQxMm92TnMxbDR6L0ZkZ1dESHZuM2wwZWxtYWhyMGtnPQ==
Content-Length: 166
Connection: close
Cookie: __cfduid=dcf9fd66583d55382f362c18a83d904ca1544519479;
_gcl_au=1.1.2035701377.1544519485; _ga=GA1.2.1740241044.1544519486;
com.silverpop.iMAWebCookie=e88c45d1-3c24-11c6-089e-e287aae2c678;
__cfruid=3eebbdc1e401ed560c23a7c474c41e59b2e93018-1544520179;
device_cookie=1; __gads=ID=a1e437c03ddad1b3:T=1544519579:S=ALNI_MYb30xY4z76J4NniCK_ZtOyOdPMKA;_lb_user=gfpuzje6kg; seeking_session=eyJpdiI6Im4yMTNJNVNRZjkxbnZzMmNpYnQ4dkE9PSIsInZhbHVlIjoiVGhGVUJDejc1dElJbEwxekh5d2hXUnhjeDlpVWR2dW9IWWJqeDZvRmI3VU9Pc1lpZXZGWGJxejQ1alNXbGVXUGJqaEpORU9LNFJITVh0N3IwR1E0bUE9PSIsIm1hYyI6IjUyODU3MWIxYjM3MGU3M2E0YjI1YzM2MzNmNDc5ZDMzZDdjYTg1ZWMxYWU2ODJjY2JlMTJmZWJlNmUyZDkyNWMifQ%3D%3D {"captcha":"","email":"MYEMAIL","password":"MYPASS","is_rememberme":0,"locale":"en_US","auth_type":"bearer_token","date":"2018-12-14T09:15:56.016Z"}
I am completely new to this, and have spent 2 days trying to figure out what i need to pass to this POST to login.
My question is
1) based on this intercept what should my request via FormRequest look like?
2) I see there are cookies/authorization (Authorization token, that changes with each POST, session cookies, etc) tokens that are being passed in to the post... Where do they come from? How do i get them when i am scraping so that i can successfully login?
3) Do i need to store these session variables when scraping other pages on the site after login? Anything special i need to do to stay logged in to access other pages?
It looks like the login page is expecting to be passed soon data, and not a url-encoded string (which is what FormRequest will create).
Something like this should work:
r = scrapy.Request(
url=url,
method='POST',
body=json.dumps(frmdata),
headers={'Content-Type': 'application/json'},
)
The tokens, cookies, etc. are probably created when you initially request the login page, so you might need to request the login page before trying to log in.
It is possible that some of it is generated with javascript (haven't checked), so you might need to dig through the js code to figure out what's going on, or even execute the js yourself (e.g. using a browser).
Scrapy will keep track of your session for you, so there's nothing you need to do to stay logged in.

How to manipulate the content body of the GET request in Python

I know there are tons of ways to add headers or cookies something like this. But what I want to do is to add "\r\n" on the top of the request so as to look like the following body.
Request Body >>
\r\n <-- technically invisible..
GET /path/ HTTP/1.1
Host: www.website.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22
Referer: https://www.google.com/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,ko;q=0.6
Accept-Charset: windows-949,utf-8;q=0.7,*;q=0.3
\r\n is added on the first line of the GET request as you can see.
It's like adding an empty line.
How can I do this in Python?
I've spent hours on this topic but couldn't find any useful resources.
===================== ADD ============================================
It's about hacking.
In South Korea, Government restricts some sites, but the filters preventing users from connecting to the sites can easily be evaded by just adding "\r\n" on the top of the request body.
httplib2, httplib, urllib, urllib2, etc.. etc..
Whatever library to be used, I just need to add "\r\n" to the request body.
You could do this by monkeypatching the httplib.HTTPConnection class; urllib, urllib2, requests etc. all use that class to handle the low-level HTTP conversation.
The easiest is to patch the HTTPConnection._output() method to insert the extra characters before a HTTP version message:
from httplib import HTTPConnection, _CS_REQ_STARTED
orig_output = HTTPConnection._output
def add_initial_newline_output(self, s):
if (self._HTTPConnection__state == _CS_REQ_STARTED and
s.endswith(self._http_vsn_str) and not self._buffer):
self._buffer.append('') # will insert extra \r\n
orig_output(self, s)
HTTPConnection._output = add_initial_newline_output
This will only insert the extra starting empty line when the connection is in the correct state (request started), the line ends with the current HTTP version string, and the buffer is still empty.

Running a Stack Overflow query from Python

I'm working on improving my answer to a question on Meta Stack Overflow. I want to run a search on some Stack Exchange site and detect whether I got any results. For example, I might run this query. When I run the query through my browser, I don't see the string "Your search returned no matches" anywhere in the html I get. But when I run this Python code:
"Your search returned no matches" in urllib2.urlopen("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22+").read()
I get True, and in fact the string contains a page that is clearly different from the one I get in my browser. How can I run the search in a way that gets me the same result I get when running the query in the normal, human way (from a browser)?
UPDATE: here's the same thing done with requests, as suggested by #ThiefMaster♦. Unfortunately it gets the same result.
"Your search returned no matches" in requests.get("https://math.stackexchange.com/search?q=user%3Ame+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22").text
I used FireBug to view the header of the GET that runs when I run the search from my browser. Here it is:
GET /search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22 HTTP/1.1
Host: math.stackexchange.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://math.stackexchange.com/search?q=user%3A128043+hasaccepted%3Ano+answers%3A1+lastactive%3A2013-12-24..2014-02-22
Cookie: __qca=P0-1687127815-1387065859894; __utma=27693923.779260871.1387065860.1393095534.1393101885.19; __utmz=27693923.1393095534.18.10.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/users/2829764/kuzzooroo; _ga=GA1.2.779260871.1387065860; mathuser=t=WM42SFDA5Uqr&s=OsFGcrXrl06g; sgt=id=bedc99bd-5dc9-42c7-85db-73cc80c4cc15; __utmc=27693923
Connection: keep-alive
Running requests.get with various pieces of this header didn't work for me, though I didn't try everything, and there are lots of possibilities.
Some sites create different results depending on which client connects to them. I do not know whether this is the case with stackoverflow. But I recognied it with wikis.
Here is what I do to pretend I am an Opera browser:
def openAsOpera(url):
u = urllib.URLopener()
u.addheaders = []
u.addheader('User-Agent', 'Opera/9.80 (Windows NT 6.1; WOW64; U; de) Presto/2.10.289 Version/12.01')
u.addheader('Accept-Language', 'de-DE,de;q=0.9,en;q=0.8')
u.addheader('Accept', 'text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1')
f = u.open(url)
content = f.read()
f.close()
return content
Surely you can adapt this to pretend the client is Firefox.

How to accept the GET request in Python socket application?

I am writing a very basic web server as a homework assignment and I have it running on localhost port 14000. When I browser to localhost:14000, the server sends back an HTML page with a form on it (the form's action is the same address - localhost:14000, not sure if that's proper or not).
Basically I want to be able to gather the data from the GET request once the page reloads after the submit - how can I do this? How can i access the stuff in the GET in general?
NOTE: I already tried socket.recv(xxx), that doesn't work if the page is being loaded first time - in that case we are not "receiving" anything from the client so it just keeps spinning.
The secret lies in conn.recv which will give you the headers sent by the browser/client of the request. If they look like the one I generated with safari you can easily parse them (even without a complex regex pattern).
data = conn.recv(1024)
#Parse headers
"""
data will now be something like this:
GET /?banana=True HTTP/1.1
Host: localhost:50008
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
"""
#A simple parsing of the get data would be:
GET={i.split("=")[0]:i.split("=")[1] for i in data.split("\n")[0].split(" ")[1][2:].split("&")}

Categories

Resources