I need to fill in a login form on a webpage that requires cookies and get some information about the resultant page. Since this needs to be done at very weird hours at night, I'd like to automate the process and am therefore using mechanize (any other suggestions are welcome - note that I have to run my script on a school server, on which I cannot install new software. Mechanize is pure python so I am able to get around this problem).
The problem is that the page that hosts the login form requires that I be able to accept and send cookies. Ideally, I'd like to be able to accept and send all cookies that I the server sends me, rather than hard-code my own cookies.
So, I set out to write my script with mechanize, but I seem to be handling cookies wrong. Since I can't find helpful documentation anywhere (please point it out if I'm blind), I am asking here.
Here is my mechanize script:
import mechanize as mech
br = mech.Browser()
br.set_handle_robots(False)
print "No Robots"
br.set_handle_redirect(True)
br.open("some internal uOttawa website")
br.select_form(nr=0)
br.form['j_username'] = 'my username'
print "Login: ************"
br.form['j_password'] = 'my password'
print "Password: ************"
response = br.submit()
print response.read()
This prints the following
No Robots
Login: ************
Password: ************
<html>
<body>
<img src="/idp/images/uottawa-logo-dark.png" />
<h3>ERROR</h3>
<p>
An error occurred while processing your request. Please contact your helpdesk or
user ID office for assistance.
</p>
<p>
This service requires cookies. Please ensure that they are enabled and try your
going back to your desired resource and trying to login again.
</p>
<p>
Use of your browser's back button may cause specific errors that can be resolved by
going back to your desired resource and trying to login again.
</p>
<p>
If you think you were sent here in error,
please contact technical support
</p>
</body>
</html>
This is indeed the page that I would get if I disabled cookies on my Chrome browser and attempted the same thing.
I've tried adding a cookie jar as follows, with no luck.
br = mech.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
I took a look at multiple mechanize documentation sources. One of them mention
A common mistake is to use mechanize.urlopen(), and the .extract_cookies() and
.add_cookie_header() methods on a cookie object themselves.
If you use mechanize.urlopen() (or OpenerDirector.open()),
the module handles extraction and adding of cookies by itself,
so you should not call .extract_cookies() or .add_cookie_header().
This seems to say that my first method should work, but it doesn't.
I'd appreciate any help with this - it's confusing, and there seems to be a severe lack of documentation.
I came across the exact same message while authenticating a Shibboleth website with Mechanize, just because I made the same mistake than you. And it looks like I figured it out.
Short answer
The link you need to open is:
br.open("https://web30.uottawa.ca/Shibboleth.sso/Login?target=https://web30.uottawa.ca/hr/web/post-register")
Instead of:
br.open("https://idp.uottawa.ca/idp/login.jsp?actionUrl=%2Fidp%2FAuthn%2FUserPassword")
Why?
Shibboleth: Connect easily and securely to a variety of services with
one simple login.
The Shibboleth login itself is useless if you don't tell him which service you want to login. Let's analyse the HTTP headers and compare the cookies you get for both queries.
1. Opening https://idp.uottawa.ca/idp/login.jsp?actionUrl=%2Fidp%2FAuthn%2FUserPassword
Cookie: JSESSIONID=C2D4A19B2994BFA287A328F71A281C49; _ga=GA1.2.1233451770.1401374115; arp_scroll_position=-1; tools-resize=tools-resize-small; lang-prev-page=en; __utma=251309913.1233451770.1401374115.1401375882.1401375882.1; __utmb=251309913.14.9.1401376471057; __utmz=251309913.1401375882.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); lang=en
2. Opening https://web30.uottawa.ca/Shibboleth.sso/Login?target=https://web30.uottawa.ca/hr/web/post-register
Cookie: JSESSIONID=8D6BEA53823CC1C3045B2CE3B1D61DB0; _idp_authn_lc_key=fc18251e-e5aa-4f77-bb17-5e893d8d3a43; _ga=GA1.2.1233451770.1401374115; arp_scroll_position=-1; tools-resize=tools-resize-small; lang-prev-page=en; __utma=251309913.1233451770.1401374115.1401375882.1401375882.1; __utmb=251309913.16.9.1401378064938; __utmz=251309913.1401375882.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); lang=en
What's the difference? You got one more cookie: _idp_authn_lc_key=1c21128c-2fd7-45d2-adac-df9db4d0a9ad;. I suppose it is the cookie saying "I want to login there".
During the authentication process, the IdP will set a cookie named
_idp_authn_lc_key. This cookie contains only information necessary to identify the current authentication process (which usually spans
multiple requests/responses) and is deleted after the authentication
process completes.
Source: https://wiki.shibboleth.net/confluence/display/SHIB2/IdPCookieUsage
How did I find that link? I indeed digged the web and found that https://web30.uottawa.ca/hr/web/en/user/registration redirects to the login form with the following link:
<a href="https://web30.uottawa.ca/Shibboleth.sso/Login?target=https://web30.uottawa.ca/hr/web/post-register"
class="button standard"><span>Create your account using infoweb</span></a>
So that was not a problem with Mechanize, but more that Shibboleth is a little hard to understand at first glance. You will find more information on the Shibboleth authentification flow here.
The website you're submitting your form data to probably needs a CSRF token (a cookie provided in the form you're skipping the download of.)
Try using Requests:
http://docs.python-requests.org/en/latest/user/quickstart/#cookies
Look for the cookies and/or hidden form fields and then fire away.
Related
I'm trying to make a script to auto-login to this website and I'm having some troubles. I was hoping I could get assistance with making this work. I have the below code assembled but I get 'Your request cannot be processed at this time\n' in the bottom of what's returned to me when I should be getting some different HTML if it was successful:
from pyquery import PyQuery
import requests
url = 'https://licensing.gov.nl.ca/miriad/sfjsp?interviewID=MRlogin'
values = {'d_1553779889165': 'email#email.com',
'd_1553779889166': 'thisIsMyPassw0rd$$$',
'd_1618409713756': 'true',
'd_1642075435596': 'Sign in'
}
r = requests.post(url, data=values)
print (r.content)
I do this in .NET, but I think the logic can be written in Python as well.
Firstly, I always use Fiddler to capture requests that a webpage sends then identify the request which you want to replicate and add all the cookies and headers that are sent with it in your code.
After sending the login request you will get some cookies that will identify that you've logged in and you use those cookies to proceed further in your site. For example, if you want to retrieve user's info after logging in first you need to trick the server thinking that you are logged in and that is where those log in cookies will help you
Also, I don't think the login would be so simple through a script because if you're trying to automate a government site, they may have some anti-bot security there lying there, some kind of fingerprint or captcha.
Hope this helps!
I want to log in via a Python script to a website and do some operations there. The process will be:
Login to the website.
Press a specific button to get forwarded to the new page.
Get specific data from the forwarded page and do operations like putting values in fields and press the save button.
My problem is, I can't get access to the website.
The error message in PyCharm(IDE):
<div class="content-container"><fieldset>
<h2>401 - Unauthorized: Access denied due to invalid credentials.</h2>
<h3>The credentials provided do not authorize you to view this directory or page.</h3>
</fieldset></div>
I linked an image with the wanted website login form:
Login window.
I am unsure if I need a http/s request or if it's done with JavaScript because I have no knowledge about both.
I can reach some kind of success with this:
Result on main page.
But it's just giving me like 10% of the information I need. Since it's also hard to visualize, I can't really tell if it is the site I expected.
I have used the requests module for this:
user_name = file[0]
password = file[1]
login_url = r"https://.../..."
response = requests.get(login_url,
auth=HTTPBasicAuth(user_name, password))
print(response.text)
What I have used:
PyCharm IDE
Python module Requests
The wanted website
I also tried to get it to work with the mechanize module. But I could not even login into the website at all.
So im quite unsure how to explain my issue. So - I try to scrape a schedule page (of my school) to make It easier to read. Unfortunately i couldnt figure how to pass the creditals to the login prompt with python.
url = "https://www.diltheyschule.de/vertretungsplan/
or rather this one due to it contains the actual data.
url = https://www.diltheyschule.de/vertretungsplan/f1/subst_001.htm
I do know the password and username.
Login prompt looks like this :
As you might have guessed - i want to pass password and username to this prompt.
This code doesnt work for me - it returns unauthorized error.
import requests
session = requests.Session()
r = session.post("https://www.diltheyschule.de/vertretungsplan/",data={"log":"xxx","pwd":"xxx"})
#or
r = session.post("https://www.diltheyschule.de/vertretungsplan/f1/subst_001.htm",data={"log":"xxx","pwd":"xxx"})
print(r.content)
output
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>401 Unauthorized</title>
</head><body>
<h1>Unauthorized</h1>
<p>This server could not verify that you
are authorized to access the document
requested. Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>
<hr>
<address>Apache Server at www.diltheyschule.de Port 443</address>
</body></html>
prolly essential information :
the goal is to scrape 'https://www.diltheyschule.de/vertretungsplan/f1/subst_001.htm'
passing pwd and log to the prompt (most likely without gui support (e.g. selenium))
This directory is secured by basic auth authentication. This is the easiest method of authentication where you can log in with the appropriate headers.
Are you also sure that you want to use POST method for see what is in .html page?
Please, try this:
import requests
session = requests.Session()
r = session.get("https://www.diltheyschule.de/vertretungsplan/f1/subst_001.htm",auth=requests.auth.HTTPBasicAuth('user', 'pass'))
print(r.content)
Here is a piece of code that I use to fetch a web page HTML source (code) by its URL using Google App Engine:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
print "content-type: text/plain"
print
print result.content
Everything is fine here, but sometimes I need to get an HTML source of a page from a site where I am registered and can only get an access to that page if I firstly pass my ID and password. (It can be any site, actually, like any mail-account-providing site like Yahoo: https://login.yahoo.com/config/mail?.src=ym&.intl=us or any other site where users get free accounts by firstly getting registered there).
Can I somehow do it in Python (trough "Google App Engine")?
You can check for an HTTP status code of 401, "authorization required", and provide the kind of HTTP authorization (basic, digest, whatever) that the site is asking for -- see e.g. here for more details (there's not much that's GAE specific here -- it's a matter of learning HTTP details and obeying them!-).
As Alex said you can check for status code and see what type of autorization it wants, but you can not generalize it as some sites will not give any hint or only allow login thru a non standard form, in those cases you may have to automate the login process using forms, for that you can use library like twill (http://twill.idyll.org/)
or code a specific form submit for each site.
I've been googling around for quite some time now and can't seem to get this to work. A lot of my searches have pointed me to finding similar problems but they all seem to be related to cookie grabbing/storing. I think I've set that up properly, but when I try to open the 'hidden' page, it keeps bringing me back to the login page saying my session has expired.
import urllib, urllib2, cookielib, webbrowser
username = 'userhere'
password = 'passwordhere'
url = 'http://example.com'
webbrowser.open(url, new=1, autoraise=1)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http://example.com', login_data)
resp = opener.open('http://example.com/afterlogin')
print resp
webbrowser.open(url, new=1, autoraise=1)
First off, when doing cookie-based authentication, you need to have a CookieJar to store your cookies in, much in the same way that your browser stores its cookies a place where it can find them again.
After opening a login-page through python, and saving the cookie from a successful login, you should use the MozillaCookieJar to pass the python created cookies to a format a firefox browser can parse. Firefox 3.x no longer uses the cookie format that MozillaCookieJar produces, and I have not been able to find viable alternatives.
If all you need to do is to retrieve specific (in advance known format formatted) data, then I suggest you keep all your HTTP interactions within python. It is much easier, and you don't have to rely on specific browsers being available. If it is absolutely necessary to show stuff in a browser, you could render the so-called 'hidden' page through urllib2 (which incidentally integrates very nicely with cookielib), save the html to a temporary file and pass this to the webbrowser.open which will then render that specific page. Further redirects are not possible.
I've generally used the mechanize library to handle stuff like this. That doesn't answer your question about why your existing code isn't working, but it's something else to play with.
The provided code calls:
opener.open('http://example.com', login_data)
but throws away the response. I would look at this response to see if it says "Bad password" or "I only accept IE" or similar.