Scraping data from external site with username and password - python

I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.

Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.

Related

Python SAML Authentication Automation

I am trying to work on a project that collects all my monthly utility amounts and disperses the amounts owed to my roommates. I have managed to programmatically log into two websites but I am having trouble with the last, as they are using SAML (https://www.blackhillsenergy.com/). I have inspected the web requests with Chrome's Developer Tools but I am not getting any breakthroughs. I attempted to use requests_ecp but I am not having any luck with that either. I get the idea of SAML but having a hard time understanding their implementation and how I can use it in my script. Below is my sample code? Any ideas?
def get_bh_bill():
url = 'https://www.blackhillsenergy.com/cpm/v1/user/accounts?username={fill here}'
bh_login = ''
bh_pass = ‘'
# Start a session so we can have persistent cookies
session = requests.session()
session.auth = HTTPECPAuth('https://sso.blackhillsenergy.com', username=bh_login, password=bh_pass)
acc_res = session.get(url)
acc_soup = BeautifulSoup(acc_res.text, "html.parser")
print(acc_soup.prettify())
return '0000'
SAML, typically, works like this.
You hit the desired site, they see you are not authenticated, so they create a SAML request, route it through your browser, and send you to an IdP, Identity Provider.
The IdP reads the SAML request, and then asks you for credentials. Once authenticated, it creates a SAML response, and routes that back to the original site, through your browser.
The routing is done by presenting a simple HTML form containing the SAML Request/Response, and a teeny bit of javascript to submit it. This is how it moves information across domains (SAML is typically done across domains, this is why it doesn't use cookies.)
What your script needs to do is basically follow the workflow, submit the forms automatically, login when asked, and submit the forms back. It's a multi step workflow. There may well be a bunch of redirects involved as well.

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

How To Reflect The Website Authentication Done With Python Script Into The Browser?

My question may sound stupid, but I just want to know if this is possible to browse the web pages which needs authentication after doing the authentication with python requests library.
I've a script to login into the application, which successfully authenticate the user into application, but is there a way to reflect that in a browser like in Chrome? so that user could directly access the authenticated page without having to fill the form and login. It's happening inside my application, so I'm not breaching any privacy policy of such things.
Any suggestions would be great.
For example I authenticated myself into http://example.com/login through the script, I want to be able to directly browse http://example.com/user/home in the browser. How this could be possible?
The simplest way is to have your python application log in the web application, and the have it request an token. Then the script will open the browser, passing along the token. Your web application takes that token and uses it in lieu of the login form to authenticate your user and create their session.
Take a look at OAuth, I think it has specific workflows for this kind of scenario. Otherwise you can craft your own.

Python login a web by requests module

I am trying to log into a website by requests module. The blank to be filled as username has no "name" attribute, nor does the password. However, that is what request need to konw. So what should I do to log in?
The website is:
https://passport.lagou.com/login/login.html
Thank you!
Open your favorite browser's Developer Tools (F12 on Chrome, Ctrl+Shift+I on Firefox, etc.), and reproduce the HTTP request displayed in the Network tab when trying to login.
In your case, some parameters like username and password are being sent as Form Data in a POST request to https://passport.lagou.com/login/login.json
Depending on the web application's implementation, you might also need to send some request headers, and it could also simply not work at all for various reasons.

Python urllib2 accesses page without sending authentication details

I was reading urllib2 tutorial wherein it mentions that in order to access a page that requires authentication (e.g. valid username and password), the server first sends an HTTP header with error code 401 and (python) client then sends a request with authentication details.
Now, the problem in my case is that there exist two different versions of a webpage, one that can be accessed without supplying any authentication details and one that is quite different when authentication details are supplied (i.e. when the user is logged in the system). As an example think about url www.gmail.com, when you are not logged in you get a log-in page, but if your browser remembers you from your last login then the result is your email account homepage with your inbox displayed.
I follow all the details to set up an handler for authentication and install an opener. However everytime I request the page get back the version of the webpage that does not have the user logged-in.
How can I access the other version of webpage that has the user logged-in?
Requests makes this easy. As its creators say:
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken.
Try using Mechanize. It has cookie handling features that would allow your program to be "logged in" even though it's not a real person.

Categories

Resources