Python unable to retrieve form with urllib or mechanize - python

I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.
The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.
First of all, this is the urllib/urllib2 method I've tried:
import urllib, urllib2
import socket, cookielib
url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
'submit' : "Uitvoeren"}
http_header = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }
timeout = 15
socket.setdefaulttimeout(timeout)
request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)
cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler, cookie_handler)
response = opener.open(request)
html = response.read()
However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.
Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)
where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.
Thanks in advance for your help.

It's because the DOCTYPE part is malformed.
Also it contains some strange tags like:
<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef#law.leidenuniv.nl >
Try validating the page yourself...
Nonetheless, you can just strip off the junk to make mechanizes html parser happy:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)
br.select_form(nr = 0)

Related

Python web scraping login

I am trying to login to a website using python.
The login URL is :
https://login.flash.co.za/apex/f?p=pwfone:login
and the 'form action' url is shown as :
https://login.flash.co.za/apex/wwv_flow.accept
When I use the ' inspect element' on chrome when logging in manually, these are the form posts that show up (pt_02 = password):
There a few hidden items that I'm not sure how to add into the python code below.
When I use this code, the login page is returned:
import requests
url = 'https://login.flash.co.za/apex/wwv_flow.accept'
values = {'p_flow_id': '1500',
'p_flow_step_id': '101',
'p_page_submission_id': '3169092211412',
'p_request': 'LOGIN',
'p_t01': 'solar',
'p_t02': 'password',
'p_checksum': ''
}
r = requests.post(url, data=values)
print r.content
How can I adjust this code to perform a login?
Chrome network:
This is more or less your script should look like. Use session to handle the cookies automatically. Fill in the username and password fields manually.
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'username',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content
since I cannot recreate your case I can't tell you what exactly to change, but when I was doing such things I used Postman to intercept all requests my browser sends. So I'd install that, along with browser extension and then perform login. Then you can view the request in Postman, also view the response it received there, what's more it provides you with Python code of request too, so you could simply copy and use it then.
Shortly, use Pstman, perform login, clone their request.

Using MechanicalSoup behind proxy

I am trying to build a simple webbot in Python, on Windows, using MechanicalSoup. Unfortunately, I am sitting behind a (company-enforced) proxy. I could not find a way to provide a proxy to MechanicalSoup. Is there such an option at all? If not, what are my alternatives?
EDIT: Following Eytan's hint, I added proxies and verify to my code, which got me a step further, but I still cannot submit a form:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
front_page = browser.open(url, proxies=proxies, verify=False)
form = browser.select_form('form[action="/search"]')
form.print_summary()
form["q"] = "MechanicalSoup"
form.print_summary()
browser.submit(form, url=url)
The code hangs in the last line, and submitdoesn't accept proxies as an argument.
It seems that proxies have to be specified on the session level. Then they are not required in browser.open and submitting the form also works:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxies # THIS IS THE SOLUTION!
front_page = browser.open(url, verify=False)
form = browser.select_form('form[action="/search"]')
form["q"] = "MechanicalSoup"
result = browser.submit(form, url=url)
result.status_code
returns 200 (i.e. "OK").
According to their doc, this should work:
browser.get(url, proxies=proxy)
Try passing the 'proxies' argument to your requests.

programmatically sign into amazon

I am looking for a way to sign into amazon.com via a post sent in the body of the header. I am new to python and still learning the specifics. When I do a http sniff of the login to amazon.com I get several headers which I input into my code using requests. I verified the names of each and that none of them change. I have it writing to a file which I can load as a webpage to verify, and it loads amazons homepage but it shows I am not signed in. When I change my email address or pw to make it incorrect it makes no difference and I still get code 200. I don't know what i'm doing wrong to sign in via post. I would greatly appreciate any help. my code is as follows:
import requests
s = requests.Session()
login_data = {
'email':'myemail',
'password':'mypasswd',
'appAction' : 'SIGNIN',
'appActionToken' : 'RjZHAvZ7X4o8bm0eM2vFJFj2BYqZMj3D',
'openid.pape.max_auth_age' : 'ape:MA==',
'openid.ns' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjA=',
'openid.ns.pape' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvZXh0ZW5zaW9ucy9wYXBlLzEuMA==',
'prevRID' : 'ape:MVQ4MVBLWDVEMUI0QjA3WlkyMEE=', # changes
'pageId' : 'ape:dXNmbGV4',
'openid.identity' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.claimed_id' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.mode' : 'ape:Y2hlY2tpZF9zZXR1cA==',
'openid.assoc_handle' : 'ape:dXNmbGV4',
'openid.return_to' : 'ape:aHR0cHM6Ly93d3cuYW1hem9uLmNvbS9ncC95b3Vyc3RvcmUvaG9tZT9pZT1VVEY4JmFjdGlvbj1zaWduLW91dCZwYXRoPSUyRmdwJTJGeW91cnN0b3JlJTJGaG9tZSZyZWZfPW5hdl95b3VyYWNjb3VudF9zaWdub3V0JnNpZ25Jbj0xJnVzZVJlZGlyZWN0T25TdWNjZXNzPTE=',
'create' : '0',
'metadata1' : `'j1QDaKdThighCUrU/Wa7pRWvjeOlkT+BQ7QFj2d9TXIlN6VtoSYle6l9S2z2chZ/EwnMx1xy2EjY5NMRS6hW+jgTFNBJlwSlw+HphykjlOm8Ejx47RrNRXOrwwdFHiUBSB6Jt+GGCp/+5gPXrt8DyrgYFxqVlGRkupsqFjMjtGWaYhEvNsBQmUZ4932fJpeYRJzWxi5KbqG4US53JQCB++aGYgicdOcSY2aTMGEqEkrAPu+xBE9NEboMBepju2OPcAsCzkf+hpugIPYl4xYhFpBarAYlRMgpbypi82+zX6TMCj5/b2k9ky1/fV4LvvvOd3embFEJXrzMIH9enK8F1BH5LJiRQ7e45ld+KqlcSb/cdJMqIXPIeJAy8r88LX3pB65IxR46Z89Sol1XmaOSfP0P626nIvIYv7a0oEyg9SHiAJ2LUpZLlqCh1Tax9f/Mz1ZiqMmFRal8UxyUZbcc5qWI24xKY86P/jwG007kWQiXhcBPUbQnjCWXVPcRV14FAyLQ9OOsiu7OyjElo/Sd1NvqaKVtz6DKy2RyDkZo82WvgPsj6BkUr0oIEUAYggSHHZxhsDicaIzgfJ5/OQeCLEvLUPjLJbnl2FK8xOG0FQAsl2Floso4Lgqd46V38frC4kD87bqQezQqmCr343FgT4uCRgP0LGBf5iSP70kj9JZolGQRSfHy+7DqwLsoEtZa9kJgrJlUNWTrC/XQZZyvA8yzWvz9jaEcPH3nGrUEMLI8airUMwpvK0JrGEHNzJZjPIWzz4bI5G1f1TL8F2eY+iZ6jDPK8mBQHh4zO2b/InaKn2NydtX42QF5lGYagsMEPn3vMgvYgrHwu5YS38ywDUobDxjDhnUCbCvNy4Cot2XMjBz1/S3X5tv4b540yDXhgJWH4h6OZOgs9gjlotv3rk24xYYPlZp+6WgrAQPPZ5VZJorh4dyMvkM7KNhzaK+ejQDMTIZG7096kGf+iQkfudXzg8k8YAXoerRvKpWgckUeZyY2cEwXpZCBsK2zZvuvuyuaHdKbVr8VTgJo'`
}
r = s.post('https://www.amazon.com/ap/signin', login_data)
s.cookies
r = s.get('https://www.amazon.com')
for cookie in s.cookies:
print 'cookie name ' + cookie.name
print 'cookie value ' + cookie.value
with open("requests_results.html", "w") as f:
f.write(r.content)
If you just want to login to Amazon, then try using Python mechanize module. It is much more simpler and neater.
For reference check out the sample code
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = browser.open('https://www.amazon.com/gp/sign-in.html')
browser.select_form(name="sign-in")
browser["email"] = '' #provide email
browser["password"] = '' #provide passowrd
logged_in = browser.submit()
===============================
Edited: requests module
import requests
session = requests.Session()
data = {'email':'', 'password':''}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content

How to request the "load more" function with Python?

Hi~I'm now doing my research on Zhihu, a Chinese Q&A website like Quora, using social network analysis. And I'm writing a crawler with Python these days, but met a problem:
I want to scratch the user info that follows a specific user, like Kaifu-Lee. The Kaifu-Lee's followers page is http://www.zhihu.com/people/kaifulee/followers
And the load-more button is at the bottom of the followers list, I need to get full list.
Here's the way I do with python requests:
import requests
import re
s = requests.session()
login_data = {'email': '***', 'password': '***', }
# post the login data.
s.post('http://www.zhihu.com/login', login_data)
# verify if I've login successfully. Surely this step have succeed.
r = s.get('http://www.zhihu.com')
Then, I jumped to the target page:
r = s.get('http://www.zhihu.com/people/kaifulee/followers')
and get 200 return:
In [7]: r
Out[7]: <Response [200]>
So the next step is to analyze the request of load-more under "network" tag using chrome's developer tool, here's the information:
Request URL: http://www.zhihu.com/node/ProfileFollowersListV2
Request Method: POST
Request Headers
Connection:keep-alive
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/kaifulee/followers
Form data
method:next
params:{"hash_id":"12135f10b08a64c54e8bfd537dd7bee7","order-by":"created","offset":20}
_xsrf:ea63beee3a3444bfb853f36b7d968ad1
So I try to POST:
global header_info
header_info = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1581.2 Safari/537.36',
'Host':'www.zhihu.com',
'Origin':'http://www.zhihu.com',
'Connection':'keep-alive',
'Referer':'http://www.zhihu.com/people/zihaolucky/followers',
'Content-Type':'application/x-www-form-urlencoded',
}
# form data.
data = r.text
raw_hash_id = re.findall('hash_id(.*)',data)
hash_id = raw_hash_id[0][14:46]
payload={"method":next,"hash_id":str(hash_id),"order_by":"created","offset":20}
# post with parameters.
url = 'http://www.zhihu.com/node/ProfileFollowersListV2'
r = requests.post(url,data=payload,headers=header_info)
BUT, it returns Response<404>>
If I made any mistake?
Someone said I made a mistake in dealing with the params. The Form Data has 3 parameters:method,params,_xsrfand I lost _xsrf and then I put them into a dictionary.
So I modified the code:
# form data.
data = r.text
raw_hash_id = re.findall('hash_id(.*)',data)
hash_id = raw_hash_id[0][14:46]
raw_xsrf = re.findall('xsrf(.*)',r.text)
_xsrf = raw_xsrf[0][9:-3]
payload = {"method":"next","params":{"hash_id":hash_id,"order_by":"created","offset":20,},"_xsrf":_xsrf,}
# reuse the session object, but still error.
>>> r = s.post(url,data=payload,headers=header_info)
>>> <Response [500]>
You can't pass nested dictionaries to the data parameter. Requests just doesn't know what to do with them.
It's not clear, but it looks like the value of the params key is probably JSON. This means your payload code should look like this:
params = json.dumps({"hash_id":hash_id,"order_by":"created","offset":20,})
payload = {"method":"next", "params": params, "_xsrf":_xsrf,}
Give that a try.

Logging into website with multiple pages using Python (urllib2 and cookielib)

I am writing a script to retrieve transaction information from my bank's home banking website for use in a personal mobile application.
The website is laid out like so:
https:/ /homebanking.purduefed.com/OnlineBanking/Login.aspx
-> Enter username -> Submit form ->
https:/ /homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx
-> Enter password -> Submit form ->
https:/ /homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx
The problem I am having is since there are 2 separate pages to make POSTs, I first thought it was a problem with the session information being lost. But I use urllib2's HTTPCookieProcessor to store the cookies and make GET and POST requests to the website, and have found that this isn't the issue.
My current code is:
import urllib
import urllib2
import cookielib
loginUrl = 'https://homebanking.purduefed.com/OnlineBanking/Login.aspx'
passwordUrl = 'https://homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx'
acctUrl = 'https://homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx'
LoginName = 'sample_username'
Password = 'sample_password'
values = {'LoginName' : LoginName,
'Password' : Password}
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print "Cookie Manipulation Right Here"
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
login_cred = urllib.urlencode(values)
jar = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5')]
opener.addheader = [('Referer', loginUrl)]
response = opener.open(loginUrl, login_cred)
reqPage = opener.open(passwordUrl)
opener.addheader = [('Referer', passwordUrl)]
response2 = opener.open(passwordUrl, login_cred)
reqPage2 = opener.open(acctUrl)
content = reqPage2.read()
Currently, the script makes it to the passwordUrl page, so the username is POSTed correctly, but when the POST is made to the passwordUrl page, instead of going to the acctUrl, it is redirected to the Login page (the redirect location if acctUrl is opened without proper or a lack of credentials).
Any thoughts or comments on how to move forward are greatly appreciated at this point!

Categories

Resources