Python urllib2 accesses page without sending authentication details

Python urllib2 accesses page without sending authentication details - python

I was reading urllib2 tutorial wherein it mentions that in order to access a page that requires authentication (e.g. valid username and password), the server first sends an HTTP header with error code 401 and (python) client then sends a request with authentication details.
Now, the problem in my case is that there exist two different versions of a webpage, one that can be accessed without supplying any authentication details and one that is quite different when authentication details are supplied (i.e. when the user is logged in the system). As an example think about url www.gmail.com, when you are not logged in you get a log-in page, but if your browser remembers you from your last login then the result is your email account homepage with your inbox displayed.
I follow all the details to set up an handler for authentication and install an opener. However everytime I request the page get back the version of the webpage that does not have the user logged-in.
How can I access the other version of webpage that has the user logged-in?

Requests makes this easy. As its creators say:
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken.

Try using Mechanize. It has cookie handling features that would allow your program to be "logged in" even though it's not a real person.

Related

Understanding the Python requests module

So I'm currently learning the python requests module but I'm a bit confused and was wondering if someone could steer me in the right direction. I've seen some people post headers when they want to log into the website, but where do they get these headers from and when do you need them? I've also seen some people say you need an authentication token, but I've seen some other solutions not even use headers or an authentication token at all. This is supposedly the authentication token but I'm not sure where to go from here after I post my username and password.
<input type="hidden" name="lt" value="LT-970332-9KawhPFuLomjRV3UQOBWs7NMUQAQX7" />

Although your question is a bit vague, I'll try to help you.
Authentication
A web browser (client) can authenticate on the target server by providing data, usually the pair login/password, which is usually encoded for security reasons.
This data can be passed from client to server using the following parts of HTTP request:
URL parameters (http://httpbin.org/get?foo=bar)
headers
body (this is where POST parameters from HTML forms usually go)
Tokens
After successful authentication server generates a unique token and sends it to client. If server wants client to store token as a cookie, it includes Set-Cookie header in its response.
A token usually represents a unique identifier of a user session. In most cases token has an expiration date for security reasons.
Web browsers usually store token as a cookie in internal cookie storage and use them in all subsequent requests to corresponding website. A single website can use multiple tokens and other cookies for a single user.
Research
Every web site has its own authentication format, rules and restrictions, so first thing you need to do is a little research on target website. You need to get information about the client sends auth information to server, what server replies and where session data is being stored (usually you can find it in client request headers).
In order to do that, you may use a proxy (Burp for example) to intercept browser traffic. It can help you to get the data passed from client to server and back.
Try to authenticate and then browse some pages on target site using your web browser with a proxy. After that, using your proxy, examine what parts of HTTP request/response do client and browser use to store information about sessions and authentication.
After that you can finally use python and requests to do what you want.

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?

The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Python login a web by requests module

I am trying to log into a website by requests module. The blank to be filled as username has no "name" attribute, nor does the password. However, that is what request need to konw. So what should I do to log in?
The website is:
https://passport.lagou.com/login/login.html
Thank you!

Open your favorite browser's Developer Tools (F12 on Chrome, Ctrl+Shift+I on Firefox, etc.), and reproduce the HTTP request displayed in the Network tab when trying to login.
In your case, some parameters like username and password are being sent as Form Data in a POST request to https://passport.lagou.com/login/login.json
Depending on the web application's implementation, you might also need to send some request headers, and it could also simply not work at all for various reasons.

Scraping data from external site with username and password

I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.

Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.

Python OpenID Library Usage

Can anyone explain me steps using openid library which is mentioned here.
I have import all package of janrain openid in my programe but I cant understand actual flow of code.
The process should basically follow this plan:
Add an OpenID login field somewhere on your site. When an OpenID is entered in that field and the form is submitted, it should make a request to the your site which includes that OpenID URL.
First, the application should instantiate a Consumer with a session for per-user state and store for shared state. using the store of choice.
Next, the application should call the 'begin' method on the Consumer instance. This method takes the OpenID URL. The begin method returns an AuthRequest object.
Next, the application should call the redirectURL method on the AuthRequest object. The parameter return_to is the URL that the OpenID server will send the user back to after attempting to verify his or her identity. The realm parameter is the URL (or URL pattern) that identifies your web site to the user when he or she is authorizing it. Send a redirect to the resulting URL to the user's browser.
That's the first half of the authentication process. The second half of the process is done after the user's OpenID Provider sends the user's browser a redirect back to your site to complete their login.
When that happens, the user will contact your site at the URL given as the return_to URL to the redirectURL call made above. The request will have several query parameters added to the URL by the OpenID provider as the information necessary to finish the request.
Get an Consumer instance with the same session and store as before and call its complete method, passing in all the received query arguments.
There are multiple possible return types possible from that method. These indicate the whether or not the login was successful, and include any additional information appropriate for their type.

web.py includes a webopenid module that implements a complete, if basic, OpenID authentication system using the Janrain. Using the included host class, you can add OpenID-backed authentication to your project. To actually pull interesting data from the OpenID provider, though, you'll need to add AX and SReg request/response handling.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.