Logging into website with multiple pages using Python (urllib2 and cookielib)

Logging into website with multiple pages using Python (urllib2 and cookielib) - python

I am writing a script to retrieve transaction information from my bank's home banking website for use in a personal mobile application.
The website is laid out like so:
https:/ /homebanking.purduefed.com/OnlineBanking/Login.aspx
-> Enter username -> Submit form ->
https:/ /homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx
-> Enter password -> Submit form ->
https:/ /homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx
The problem I am having is since there are 2 separate pages to make POSTs, I first thought it was a problem with the session information being lost. But I use urllib2's HTTPCookieProcessor to store the cookies and make GET and POST requests to the website, and have found that this isn't the issue.
My current code is:
import urllib
import urllib2
import cookielib
loginUrl = 'https://homebanking.purduefed.com/OnlineBanking/Login.aspx'
passwordUrl = 'https://homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx'
acctUrl = 'https://homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx'
LoginName = 'sample_username'
Password = 'sample_password'
values = {'LoginName' : LoginName,
'Password' : Password}
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print "Cookie Manipulation Right Here"
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
login_cred = urllib.urlencode(values)
jar = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5')]
opener.addheader = [('Referer', loginUrl)]
response = opener.open(loginUrl, login_cred)
reqPage = opener.open(passwordUrl)
opener.addheader = [('Referer', passwordUrl)]
response2 = opener.open(passwordUrl, login_cred)
reqPage2 = opener.open(acctUrl)
content = reqPage2.read()
Currently, the script makes it to the passwordUrl page, so the username is POSTed correctly, but when the POST is made to the passwordUrl page, instead of going to the acctUrl, it is redirected to the Login page (the redirect location if acctUrl is opened without proper or a lack of credentials).
Any thoughts or comments on how to move forward are greatly appreciated at this point!

Related

How do I load a XML Page thru Scrapy without getting 502 (Bad Gateway Error) with a Proxy

Hi I am seeking some help after going back and forth trying to figure this out.
Summary:
I wish to open up a URL and subsequently open the get request which turns out to be a XML like HTML Content. I need to scrape that whole response.body
Sample: https://mpv.tickets.com/api/pvodc/v1/events/navmap/availability/?pid=9016700&agency=MLB_MPV&orgId=10&supportsVoucherRedemption=true
Loading on a browser does not get me any 503 Errors. But I am getting 503 errors on scrapy.
I have tried to use the selinum_scrapy in a combination with just the normal basic type of code.
I did get results on the first few tries. However, after that it did not produce any results and always has a 503 error.
I am using BrightData Webunlocker Proxy. Headers has been also added. So I am not sure what else i can do to make it load the first URL which is the main page where I receive this Get request. (I can go directly to this as well since I do have the parameters.
class MpvticketSpider(scrapy.Spider):
name = 'mpvticket'
urlin = "https://mpv.tickets.com/?agency=MLB_MPV&orgid=10&pid=9016700"
eventid = urlin.strip().split("pid=")[1]
urlout = "https://mpv.tickets.com/api/pvodc/v1/events/navmap/availability/?
pid="+eventid+"&agency=MLB_MPV&orgId=10&supportsVoucherRedemption=true"
start_urls = [urlin]
print("\n START URL BEING RUN: ", start_urls)
def parse(self, response):
url = "https://mpv.tickets.com/api/pvodc/v1/events/navmap/availability/?pid=9016700&agency=MLB_MPV&orgId=10&supportsVoucherRedemption=true"
print("\n FIRST URL BEING RUN: ",url)
username = 'lum-customer-XXXXX-zone-zone6ticket-route_err-pass_dyn'
password = 'XXXXX'
port = XXXX
session_id = random.random()
super_proxy_url = ('http://%s-country-us-session-%s:%s#zproxy.lum-superproxy.io:%d' %
(username, session_id, password, port))
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
yield SeleniumRequest(url=url, callback=self.parse_api,meta={'proxy': super_proxy_url},headers=headers)
def parse_api(self,response):
raw_data = response.text
print(raw_data)
#More data extraction code. Only need help with the top block with how to avoid the 503 error.

programmatically sign into amazon

I am looking for a way to sign into amazon.com via a post sent in the body of the header. I am new to python and still learning the specifics. When I do a http sniff of the login to amazon.com I get several headers which I input into my code using requests. I verified the names of each and that none of them change. I have it writing to a file which I can load as a webpage to verify, and it loads amazons homepage but it shows I am not signed in. When I change my email address or pw to make it incorrect it makes no difference and I still get code 200. I don't know what i'm doing wrong to sign in via post. I would greatly appreciate any help. my code is as follows:
import requests
s = requests.Session()
login_data = {
'email':'myemail',
'password':'mypasswd',
'appAction' : 'SIGNIN',
'appActionToken' : 'RjZHAvZ7X4o8bm0eM2vFJFj2BYqZMj3D',
'openid.pape.max_auth_age' : 'ape:MA==',
'openid.ns' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjA=',
'openid.ns.pape' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvZXh0ZW5zaW9ucy9wYXBlLzEuMA==',
'prevRID' : 'ape:MVQ4MVBLWDVEMUI0QjA3WlkyMEE=', # changes
'pageId' : 'ape:dXNmbGV4',
'openid.identity' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.claimed_id' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.mode' : 'ape:Y2hlY2tpZF9zZXR1cA==',
'openid.assoc_handle' : 'ape:dXNmbGV4',
'openid.return_to' : 'ape:aHR0cHM6Ly93d3cuYW1hem9uLmNvbS9ncC95b3Vyc3RvcmUvaG9tZT9pZT1VVEY4JmFjdGlvbj1zaWduLW91dCZwYXRoPSUyRmdwJTJGeW91cnN0b3JlJTJGaG9tZSZyZWZfPW5hdl95b3VyYWNjb3VudF9zaWdub3V0JnNpZ25Jbj0xJnVzZVJlZGlyZWN0T25TdWNjZXNzPTE=',
'create' : '0',
'metadata1' : `'j1QDaKdThighCUrU/Wa7pRWvjeOlkT+BQ7QFj2d9TXIlN6VtoSYle6l9S2z2chZ/EwnMx1xy2EjY5NMRS6hW+jgTFNBJlwSlw+HphykjlOm8Ejx47RrNRXOrwwdFHiUBSB6Jt+GGCp/+5gPXrt8DyrgYFxqVlGRkupsqFjMjtGWaYhEvNsBQmUZ4932fJpeYRJzWxi5KbqG4US53JQCB++aGYgicdOcSY2aTMGEqEkrAPu+xBE9NEboMBepju2OPcAsCzkf+hpugIPYl4xYhFpBarAYlRMgpbypi82+zX6TMCj5/b2k9ky1/fV4LvvvOd3embFEJXrzMIH9enK8F1BH5LJiRQ7e45ld+KqlcSb/cdJMqIXPIeJAy8r88LX3pB65IxR46Z89Sol1XmaOSfP0P626nIvIYv7a0oEyg9SHiAJ2LUpZLlqCh1Tax9f/Mz1ZiqMmFRal8UxyUZbcc5qWI24xKY86P/jwG007kWQiXhcBPUbQnjCWXVPcRV14FAyLQ9OOsiu7OyjElo/Sd1NvqaKVtz6DKy2RyDkZo82WvgPsj6BkUr0oIEUAYggSHHZxhsDicaIzgfJ5/OQeCLEvLUPjLJbnl2FK8xOG0FQAsl2Floso4Lgqd46V38frC4kD87bqQezQqmCr343FgT4uCRgP0LGBf5iSP70kj9JZolGQRSfHy+7DqwLsoEtZa9kJgrJlUNWTrC/XQZZyvA8yzWvz9jaEcPH3nGrUEMLI8airUMwpvK0JrGEHNzJZjPIWzz4bI5G1f1TL8F2eY+iZ6jDPK8mBQHh4zO2b/InaKn2NydtX42QF5lGYagsMEPn3vMgvYgrHwu5YS38ywDUobDxjDhnUCbCvNy4Cot2XMjBz1/S3X5tv4b540yDXhgJWH4h6OZOgs9gjlotv3rk24xYYPlZp+6WgrAQPPZ5VZJorh4dyMvkM7KNhzaK+ejQDMTIZG7096kGf+iQkfudXzg8k8YAXoerRvKpWgckUeZyY2cEwXpZCBsK2zZvuvuyuaHdKbVr8VTgJo'`
}
r = s.post('https://www.amazon.com/ap/signin', login_data)
s.cookies
r = s.get('https://www.amazon.com')
for cookie in s.cookies:
print 'cookie name ' + cookie.name
print 'cookie value ' + cookie.value
with open("requests_results.html", "w") as f:
f.write(r.content)

If you just want to login to Amazon, then try using Python mechanize module. It is much more simpler and neater.
For reference check out the sample code
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = browser.open('https://www.amazon.com/gp/sign-in.html')
browser.select_form(name="sign-in")
browser["email"] = '' #provide email
browser["password"] = '' #provide passowrd
logged_in = browser.submit()
===============================
Edited: requests module
import requests
session = requests.Session()
data = {'email':'', 'password':''}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content

Python client for multipart form with CAS

I am trying to write a Python script to POST a multipart form to a site that requires authentication through CAS.
There are two approaches that both solve part of the problem:
The Python requests library works well for submitting multipart forms.
There is caslib, with a login function. It returns an OpenerDirector that can presumably be used for further requests.
Unfortunately, I can't figure out how to get a complete solution out what I have so far.
There are just some ideas from a couple hours of research; I am open to just about any solution that works.
Thanks for the help.

I accepted J.F. Sebastian's answer because I think it was closest to what I'd asked, but I actually wound up getting it to work by using mechanize, Python library for web browser automation.
import argparse
import mechanize
import re
import sys
# (SENSITIVE!) Authentication info
username = r'username'
password = r'password'
# Command line arguments
parser = argparse.ArgumentParser(description='Submit lab to CS 235 site (Winter 2013)')
parser.add_argument('lab_num', help='Lab submission number')
parser.add_argument('file_name', help='Submission file (zip)')
args = parser.parse_args()
# Go to login site
br = mechanize.Browser()
br.open('https://cas.byu.edu/cas/login?service=https%3a%2f%2fbeta.cs.byu.edu%2f~sub235%2fsubmit.php')
# Login and forward to submission site
br.form = br.forms().next()
br['username'] = username
br['password'] = password
br.submit()
# Submit
br.form = br.forms().next()
br['labnum'] = list(args.lab_num)
br.add_file(open(args.file_name), 'application/zip', args.file_name)
r = br.submit()
for s in re.findall('<h4>(.+?)</?h4>', r.read()):
print s

You could use poster to prepare multipart/form-data. Try to pass poster's opener to the caslib and use caslib's opener to make requests (not tested):
import urllib2
import caslib
import poster.encode
import poster.streaminghttp
opener = poster.streaminghttp.register_openers()
r, opener = caslib.login_to_cas_service(login_url, username, password,
opener=opener)
params = {'file': open("test.txt", "rb"), 'name': 'upload test'}
datagen, headers = poster.encode.multipart_encode(params)
response = opener.open(urllib2.Request(upload_url, datagen, headers))
print response.read()

You could write a Authentication Handler for Requests using caslib. Then you could do something like:
auth = CasAuthentication("url", "login", "password")
response = requests.get("http://example.com/cas_service", auth=auth)
Or if you're making tons of requests against the website:
s = requests.session()
s.auth = auth
s.post('http://casservice.com/endpoint', data={'key', 'value'}, files={'filename': '/path/to/file'})

Python unable to retrieve form with urllib or mechanize

I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.
The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.
First of all, this is the urllib/urllib2 method I've tried:
import urllib, urllib2
import socket, cookielib
url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
'submit' : "Uitvoeren"}
http_header = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }
timeout = 15
socket.setdefaulttimeout(timeout)
request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)
cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler, cookie_handler)
response = opener.open(request)
html = response.read()
However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.
Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)
where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.
Thanks in advance for your help.

It's because the DOCTYPE part is malformed.
Also it contains some strange tags like:
<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef#law.leidenuniv.nl >
Try validating the page yourself...
Nonetheless, you can just strip off the junk to make mechanizes html parser happy:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)
br.select_form(nr = 0)

Python form POST using urllib2 (also question on saving/using cookies)

I am trying to write a function to post form data and save returned cookie info in a file so that the next time the page is visited, the cookie information is sent to the server (i.e. normal browser behavior).
I wrote this relatively easily in C++ using curlib, but have spent almost an entire day trying to write this in Python, using urllib2 - and still no success.
This is what I have so far:
import urllib, urllib2
import logging
# the path and filename to save your cookies in
COOKIEFILE = 'cookies.lwp'
cj = None
ClientCookie = None
cookielib = None
logger = logging.getLogger(__name__)
# Let's see if cookielib is available
try:
import cookielib
except ImportError:
logger.debug('importing cookielib failed. Trying ClientCookie')
try:
import ClientCookie
except ImportError:
logger.debug('ClientCookie isn\'t available either')
urlopen = urllib2.urlopen
Request = urllib2.Request
else:
logger.debug('imported ClientCookie succesfully')
urlopen = ClientCookie.urlopen
Request = ClientCookie.Request
cj = ClientCookie.LWPCookieJar()
else:
logger.debug('Successfully imported cookielib')
urlopen = urllib2.urlopen
Request = urllib2.Request
# This is a subclass of FileCookieJar
# that has useful load and save methods
cj = cookielib.LWPCookieJar()
login_params = {'name': 'anon', 'password': 'pass' }
def login(theurl, login_params):
init_cookies();
data = urllib.urlencode(login_params)
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
try:
# create a request object
req = Request(theurl, data, txheaders)
# and open it to return a handle on the url
handle = urlopen(req)
except IOError, e:
log.debug('Failed to open "%s".' % theurl)
if hasattr(e, 'code'):
log.debug('Failed with error code - %s.' % e.code)
elif hasattr(e, 'reason'):
log.debug("The error object has the following 'reason' attribute :"+e.reason)
sys.exit()
else:
if cj is None:
log.debug('We don\'t have a cookie library available - sorry.')
else:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
# save the cookies again
cj.save(COOKIEFILE)
#return the data
return handle.read()
# FIXME: I need to fix this so that it takes into account any cookie data we may have stored
def get_page(*args, **query):
if len(args) != 1:
raise ValueError(
"post_page() takes exactly 1 argument (%d given)" % len(args)
)
url = args[0]
query = urllib.urlencode(list(query.iteritems()))
if not url.endswith('/') and query:
url += '/'
if query:
url += "?" + query
resource = urllib.urlopen(url)
logger.debug('GET url "%s" => "%s", code %d' % (url,
resource.url,
resource.code))
return resource.read()
When I attempt to log in, I pass the correct username and pwd,. yet the login fails, and no cookie data is saved.
My two questions are:
can anyone see whats wrong with the login() function, and how may I fix it?
how may I modify the get_page() function to make use of any cookie info I have saved ?

There are quite a few problems with the code that you've posted. Typically you'll want to build a custom opener which can handle redirects, https, etc. otherwise you'll run into trouble. As far as the cookies themselves so, you need to call the load and save methods on your cookiejar, and use one of subclasses, such as MozillaCookieJar or LWPCookieJar.
Here's a class I wrote to login to Facebook, back when I was playing silly web games. I just modified it to use a file based cookiejar, rather than an in-memory one.
import cookielib
import os
import urllib
import urllib2
# set these to whatever your fb account is
fb_username = "your#facebook.login"
fb_password = "secretpassword"
cookie_filename = "facebook.cookies"
class WebGamePlayer(object):
def __init__(self, login, password):
""" Start up... """
self.login = login
self.password = password
self.cj = cookielib.MozillaCookieJar(cookie_filename)
if os.access(cookie_filename, os.F_OK):
self.cj.load()
self.opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(self.cj)
)
self.opener.addheaders = [
('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
'Windows NT 5.2; .NET CLR 1.1.4322)'))
]
# need this twice - once to set cookies, once to log in...
self.loginToFacebook()
self.loginToFacebook()
self.cj.save()
def loginToFacebook(self):
"""
Handle login. This should populate our cookie jar.
"""
login_data = urllib.urlencode({
'email' : self.login,
'pass' : self.password,
})
response = self.opener.open("https://login.facebook.com/login.php", login_data)
return ''.join(response.readlines())
test = WebGamePlayer(fb_username, fb_password)
After you've set your username and password, you should see a file, facebook.cookies, with your cookies in it. In practice you'll probably want to modify it to check whether you have an active cookie and use that, then log in again if access is denied.

If you are having a hard time making your POST requests to work (like I had with a login form), it definitely pays to quickly install the Live HTTP headers extension to Firefox (http://livehttpheaders.mozdev.org/index.html). This small extension can, among other things, show you the exact POST data that are sent when you manually log in.
In my case, I had banged my head against the wall for hours because the site insisted on an extra field with 'action=login' (doh!).

Please using ignore_discard and ignore_expires while save cookie, in mine case it saved OK.
self.cj.save(cookie_file, ignore_discard=True, ignore_expires=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Logging into website with multiple pages using Python (urllib2 and cookielib) - python

Related

How do I load a XML Page thru Scrapy without getting 502 (Bad Gateway Error) with a Proxy

programmatically sign into amazon

Python client for multipart form with CAS

Python unable to retrieve form with urllib or mechanize

Python form POST using urllib2 (also question on saving/using cookies)

Categories

Resources