Sending POST request on an already open url in python - python

Basically I want to send a POST request for the following form.
<form method="post" action="">
449 * 803 - 433 * 406 = <input size=6 type="text" name="answer" />
<input type="submit" name="submitbtn" value="Submit" />
</form>
What I basically want to do is read through the page, find out the equation in the form, calculate the answer, enter the answer as parameter to send with the POST request, but without opening a new URL for the page, as a new equation comes up every time the page is opened, hence the previously obtained result becomes obsolete. Finally I want to obtain the page that comes up as a result of sending the POST request. I'm stuck at the part where I have to send a POST request without opening a new URL instance. Also, I would appreciate help on how to read through the page again after the POST request. (would calling read() suffice?)
The python code I have currently looks something like this.
import urllib, urllib2
link = "http://www.websitetoaccess.com"
f = urllib2.urlopen(link)
line = f.readline().strip()
equation = ''
result = ''
file1 = open ('firstPage.html' , 'w')
file2 = open ('FinalPage.html', 'w')
for line in f:
if 'name="answer"' in line:
result = getResult(line)
file1.write(line)
file1.close()
raw_params = {'answer': str(result), 'submit': 'Submit'}
params = urllib.urlencode(raw_params)
request = urllib2.Request(link, params)
page = urllib2.urlopen(request)
file2.write(page.read())
file2.close()

Yeah, that last link really helped, turns out I just needed to create a new session from requests like so:
s = requests.session()
res1 = s.get(url)
And add this as the post request after
res2 = s.post(url, data=post_params)
I believe this achieves the result of storing the cookies from the get request and sending them with the post request, thus maintaining the same question as the previous get request. Many thanks for your help and assistance in this problem Loknar.

I'm a bit puzzled, the POST request will always be a new separate request so I don't understand what you mean by "without opening a new URL instance" ... have you tried taking a look at what happens when you do what you're trying to do in this script manually? Like open developer console in Chrome, go to the network tab, toggle preserve log to on, delete history, and do what you're trying to do manually? Then replicate that in python? Also I recommend you try out the requests module, it makes things simpler than using urllib. Simply pip install requests (and pip install lxml).
import requests
from lxml import etree
url = 'http://www.websitetoaccess.com'
res1 = requests.get(url)
# do something with res1.content
# you could try parsing the html page with lxml
root = etree.fromstring(res1.content, etree.HTMLParser())
# do something with root, find question and calc answer?
post_params = {'answer': str(42), 'submit': 'Submit'}
res2 = requests.post(url, data=post_params)
# check res2 for success or content?
edit:
You're possibly experiencing some header issue or cookies issue. You might be receiving some session ID which enables the server to determine what question you received in the previous GET request. The POST request is a separate request from the previous GET request, it can't be combined to one single request. You should check the headers received from the previous GET request and/or try setting up session/cookies handling (easy to do if using requests, see https://requests.readthedocs.io/en/master/user/advanced/).

Related

Python request post doesn't get redirected

When I use Chrome to post a from on this website: "http://xh.5156edu.com/index.php", I get redirected to a new page. However, when I use python request module to do the post, like this:
r = requests.post("http://xh.5156edu.com/index.php", data="f_key=%B7%AB&SearchString.x=0&SearchString.y=0")
the status code is 200 and the content is not what I want. I'am sure the data is the same as the one sent by Chrome. I can not understand what's wrong with the scripts. I also tried to add some headers, which didn't work neither.
What you're passing as data are actually query parameters.
This is what you need:
import requests
params = {'f_key': '%B7%AB', 'SearchString.x': '0', 'SearchString.y': '0'}
(r := requests.post("http://xh.5156edu.com/index.php", params=params)).raise_for_status()
with open('x.html', 'w') as html:
html.write(r.text)
You can then open x.html to view the response

How to post data to a website that immediately redirects you to another one

I want to batch work on this website, but it doesn't provide a batch mode for the user, so I am thinking about using python to submit tasks.
I am not really familiar with web-scraping by python. I watched several videos on Youtube and also checked many posts here, and I can successfully do log-in in some website by clicking the mouse on the browser, check the elements, go to the Network and see what data I should put in in the POST method.
However, this website, after you submit a task, will immediately open a new url for you, and there is no sign of any POST in the Network flow. I have already spent hours trying but still don't know how to tackle this site. Can anyone help me with this?
Here in the data dictionary I have erased the email, and you can put in your own email address. If you successfully post a task to this server, you should get an email informing you when it's finished.
import requests
url1 = 'http://rna.physics.missouri.edu/vfold3D/index.html'
sequence = 'UCGGACCAUCAGGAGAAAUCCAAUGGAAAACAGGGAAACCCUAAAAGCAAUUUUGGAAGUUUAAAACCGA'
bps = '.((((((((..(((....))).)))).((((.(((...))).((((....))))....))))....))))'
jobname = 'A trial'
data = {}
data['sequence'] = sequence
data['bps'] = bps
data['jobname'] = jobname
data['email'] = '' # give an email address to receive the result
req1 = requests.post(url1, data=data)
print(req1.status_code)
The status code is 200, but I receive no email, so I don't think I successfully post anything to it.
As I have said, I don't know what should be the correct data that should be sent to the server, since I didn't see any trace of POST in the Network flow, and I never learned anything about html and so know nothing about the structure of this website...
When you print your response req1.contents using jobname = 'A trial' parameter, you get an error page that saying there's spaces/invalid chars in job name.
Please wait...<br>
<br>
Your input jobname: <br>
A trial
<br>
<br>
<font color="red" size="3">contains non-alphanumeric characters.</font><br>
<br>
</body>
</html>
You must remove those spaces.
This works and gives you the URL to check the results:
import re
import requests
server_url = 'http://rna.physics.missouri.edu/vfold3D/3D_run.pl'
sequence = 'UCGGACCAUCAGGAGAAAUCCAAUGGAAAACAGGGAAACCCUAAAAGCAAUUUUGGAAGUUUAAAACCGA'
bps = '.((((((((..(((....))).)))).((((.(((...))).((((....))))....))))....))))'
jobname = 'Atrial'
data = {
'sequence': sequence,
'bps': bps,
'jobname': jobname,
'email': ''
}
res = requests.post(server_url,
data=data,
headers={'referer': 'http://rna.physics.missouri.edu/vfold3D/index.html'})
result_url = re.search('<META HTTP-EQUIV=refresh CONTENT="0;URL=([^"]+)', res.text).group(1)
print(result_url)
output:
http://rna.physics.missouri.edu/OUTPUT/3D_Atrial.E6SY.html
You can then visit that URL and get .pdb file.
import requests
import re
from urllib.parse import urljoin
res = requests.get('http://rna.physics.missouri.edu/OUTPUT/3D_Atrial.E6SY.html')
pdb_path = re.search('<a href="(.*\.pdb)">', res.text).group(1)
pdb_url = urljoin(res.url, pdb_path)
print(pdb_url)
output:
http://rna.physics.missouri.edu/OUTPUT/3D_Atrial.E6SY.3d_struct.pdb
Note: Since these pages don't look like they will be redesigned anytime soon, and have relatively simple structure, using re to scrape off some URL is perfectly OK. But using BeautifulSoup or another HTML parser is the proper way.

Python: urllib2 get nothing which does exist

I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text

Trying to access a password protected url using python

I've had problems accessing www.bizi.si or more specifically
http://www.bizi.si/BALMAR-D-O-O/ for instance. If you look at it without registering you won't see any financial data. But if you use the free registration, I used username: Lukec, password: lukec12345, you can see some of the financial data. I've used this next code:
import urllib.parse
import urllib.request
import re
import csv
username = 'Lukec'
password = 'lukec12345'
url = 'http://www.bizi.si/BALMAR-D-O-O/'
values = {'username':username, 'password':password}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url,data,values)
resp = urllib.request.urlopen(req,data)
respData = resp.read()
paragraphs = re.findall('<tbody>(.*?)</tbody>',str(respData))
And my len(paragraphs) is zero. I would really be grateful if anyone of you would be able to tell me how to access the page correctly. I know that the length being zero isnt the best indicator, but also the len(respData) if I use values as stated in my code or if I take it out of my code is the same, so I know I have not accessed the page through username, password.
Thank you for you're help in advance and have a nice day.
There are two issues here:
You are not using POST, but GET for the request.
There is no <tbody> element in the HTML produced; any such tags have been automatically added by your browser, do not rely on them being there.
To create a POST request, use:
req = urllib.request.Request(url, data, method='POST')
resp = urllib.request.urlopen(req)
Note that I removed the values argument (those are not headers, the third positional argument to Request() and you don't pass in a data argument when using a Request object.
The resulting HTML returned does not necessarily include the same data that is sent to a browser; you probably need to maintain a session here, return the cookies that the site sets.
It is far easier to do this with better tools such as the requests library and BeautifulSoup (the latter lets you parse HTML without having to resort to regular expressions), which can be combined with the robobrowser project to help you fill and submit forms on websites.
Note however that the page forms and state are managed by ASP.NET JavaScript code and are not easily reverse-engineered even by robobrowser. When you log in with a browser (which has run the JavaScript code for you), the POST looks like this:
{'__EVENTTARGET': ['ctl00$ctl00$loginBoxPopup$loginBox1$ButtonLogin'],
'__VSTATE': [''],
'ctl00$ctl00$SearchAdvanced1$ActivitiesAndProductsSearch1$RadioButtonList1': ['TSMEDIA '
'dejavnost'],
'ctl00$ctl00$SearchAdvanced1$DropDownListYearSelection': ['2013'],
'ctl00$ctl00$SearchAdvanced1$SteviloZaposlenihDo': ['do'],
'ctl00$ctl00$SearchAdvanced1$SteviloZaposlenihOd': ['od'],
'ctl00$ctl00$SearchAdvanced1$ddlLegalEvents': ['0'],
'ctl00$ctl00$loginBoxPopup$loginBox1$Password': ['lukec12345'],
'ctl00$ctl00$loginBoxPopup$loginBox1$UserName': ['Lukec'],
'ctl00_ctl00_ScriptManager1_HiddenField': [';;AjaxControlToolkit, '
'Version=3.5.40412.0, '
'Culture=neutral, '
'PublicKeyToken=28f01b0e84b6d53e:sl:1547e793-5b7e-48fe-8490-03a375b13a33:475a4ef5:effe2a26:3ac3e789:5546a2b:d2e10b12:37e2e5c9:5a682656:12bbc599:1d3ed089:497ef277:a43b07eb:751cdd15:dfad98a5:3cf12cf1'],
'hiddenInputToUpdateATBuffer_CommonToolkitScripts': ['1']}
That's a lot more information than a simple username / password combination.
See post request using python to asp.net page for approaches on how to handle such pages instead.

How to "log in" to a website using Python's Requests module?

I am trying to post a request to log in to a website using the Requests module in Python but its not really working. I'm new to this...so I can't figure out if I should make my Username and Password cookies or some type of HTTP authorization thing I found (??).
from pyquery import PyQuery
import requests
url = 'http://www.locationary.com/home/index2.jsp'
So now, I think I'm supposed to use "post" and cookies....
ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
r = requests.post(url, cookies=ck)
content = r.text
q = PyQuery(content)
title = q("title").text()
print title
I have a feeling that I'm doing the cookies thing wrong...I don't know.
If it doesn't log in correctly, the title of the home page should come out to "Locationary.com" and if it does, it should be "Home Page."
If you could maybe explain a few things about requests and cookies to me and help me out with this, I would greatly appreciate it. :D
Thanks.
...It still didn't really work yet. Okay...so this is what the home page HTML says before you log in:
</td><td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_email.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="text" name="inUserName" id="inUserName" size="25"></td>
<td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_password.gif"> </td>
<td><input class="Data_Entry_Field_Login" type="password" name="inUserPass" id="inUserPass"></td>
So I think I'm doing it right, but the output is still "Locationary.com"
2nd EDIT:
I want to be able to stay logged in for a long time and whenever I request a page under that domain, I want the content to show up as if I were logged in.
I know you've found another solution, but for those like me who find this question, looking for the same thing, it can be achieved with requests as follows:
Firstly, as Marcus did, check the source of the login form to get three pieces of information - the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.
Once you've got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.
Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.
Example
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'username',
'inUserPass': 'password'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('LOGIN_URL', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text
# An authorised request.
r = s.get('A protected web page url')
print r.text
# etc...
If the information you want is on the page you are directed to immediately after login...
Lets call your ck variable payload instead, like in the python-requests docs:
payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)
Otherwise...
See https://stackoverflow.com/a/17633072/111362 below.
Let me try to make it simple, suppose URL of the site is http://example.com/ and let's suppose you need to sign up by filling username and password, so we go to the login page say http://example.com/login.php now and view it's source code and search for the action URL it will be in form tag something like
<form name="loginform" method="post" action="userinfo.php">
now take userinfo.php to make absolute URL which will be 'http://example.com/userinfo.php', now run a simple python script
import requests
url = 'http://example.com/userinfo.php'
values = {'username': 'user',
'password': 'pass'}
r = requests.post(url, data=values)
print r.content
I Hope that this helps someone somewhere someday.
The requests.Session() solution assisted with logging into a form with CSRF Protection (as used in Flask-WTF forms). Check if a csrf_token is required as a hidden field and add it to the payload with the username and password:
import requests
from bs4 import BeautifulSoup
payload = {
'email': 'email#example.com',
'password': 'passw0rd'
}
with requests.Session() as sess:
res = sess.get(server_name + '/signin')
signin = BeautifulSoup(res._content, 'html.parser')
payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
res = sess.post(server_name + '/auth/login', data=payload)
Find out the name of the inputs used on the websites form for usernames <...name=username.../> and passwords <...name=password../> and replace them in the script below. Also replace the URL to point at the desired site to log into.
login.py
#!/usr/bin/env python
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
payload = { 'username': 'user#email.com', 'password': 'blahblahsecretpassw0rd' }
url = 'https://website.com/login.html'
requests.post(url, data=payload, verify=False)
The use of disable_warnings(InsecureRequestWarning) will silence any output from the script when trying to log into sites with unverified SSL certificates.
Extra:
To run this script from the command line on a UNIX based system place it in a directory, i.e. home/scripts and add this directory to your path in ~/.bash_profile or a similar file used by the terminal.
# Custom scripts
export CUSTOM_SCRIPTS=home/scripts
export PATH=$CUSTOM_SCRIPTS:$PATH
Then create a link to this python script inside home/scripts/login.py
ln -s ~/home/scripts/login.py ~/home/scripts/login
Close your terminal, start a new one, run login
Some pages may require more than login/pass. There may even be hidden fields. The most reliable way is to use inspect tool and look at the network tab while logging in, to see what data is being passed on.

Categories

Resources