Using python to parse a webpage that is already open

Using python to parse a webpage that is already open - python

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?

The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Related

Downloading URL To file... Not returning JSON data but Login HTML instead

I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?

Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.

Scraping data from external site with username and password

I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.

Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.

Python: How to retrieve POST parameters from HTML? [duplicate]

I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.

You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.

Python - reading the returnURL

I'm using python without a server to deploy to. Im trying to test the accept payment flow for Paypal.
In the code after sending a POST request I store the result into a local file and then open this file using "webbrowser".
However I suspect that I am missing something here since once I log in as a user I'm not automatically redirected to authorize the transaction but rather just log in.
Now I suspect this is because once I hit the Payment API endpoint it is redirecting me to the login API with some parameters
http://www.paypal.com?hypotheticalredirecturl=etc
Now I am capturing the response i.e - the html of the paypal login page. However the whole
URL cannot be captured. Hence the part
hypotheticalredirecturl=etc
cannot be captured and I think that this is stopping the flow from going to the
authorization page after I log in as a user.
I think if I appended the "hypotheticalredirect" part to my webpage after opening it using "webbrowser" I might be able make the flow normal.
Does any one know of any way to capture the url of the response?
I tried looking into the page itself but I dont think its there.
Any help will be appreciated.
Thanks,
Ashwin
EDIT : Using urllib and urllib2. Should I be looking at httplib?

Login to site programatically using python

I am wondering on how to log in to specific site, however no luck so far.
The way it happens on browser is that you click on button, it triggers jQuery AJAX request to /ajax/authorize_ajax.html with post variables login and pass. When it returns result = true it reloads document and you are logged in.
When I go to /ajax/authorize_ajax.html on my browser it gives me {"data": [{"result":false}]} in response. Using C# I did went to this address and posted login and pass and it gave me {"data": [{"result":true}]} in response. However then, of course, when I go back to main folder of the website I'm not logged in.
Can anyone help me solve this problem? I think that cookies are set via javascript, is it even possible in that case? I did some research and all I could do is this, please help me to get around with this problem. Used urllib in python and web libraries in .NET.
EDIT 0
It is setting cookie in response headers. SID, PATH & DOMAIN.
Example: sid=bf32b9ff0dfd24059665bf1d767ad401; path=/; domain=site
I don't know how to save this cookie and go back to / using this cookie. I've never done anything like this before, can someone give me some example using python?
EDIT 1
All done, thanks to this post - How to use Python to login to a webpage and retrieve cookies for later usage?

Here's a blog post I did a while ago about using an HttpWebRequest to post to a site when cookies are involved:
http://crazorsharp.blogspot.com/2009/06/c-html-screen-scraping-part-2.html
The idea is, when you get a Response using the HttpWebRequest, you can get access to the Cookies that are sent down. For every subsequent request, you can new up a CookieContainer on the request object, and add the cookies that you got into that container.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.