Advanced screen-scraping using curl - python

I need to create a script that will log into an authenticated page and download a pdf.
However, the pdf that I need to download is not at a URL, but it is generated upon clicking on a specific input button on the page. When I check the HTML source, it only gives me the url of the button graphic and some obscure name of the button input, and action=".".
In addition, both the url where the button is and the form name is obscured, for example:
url = /WebObjects/MyStore.woa/wo/5.2.0.5.7.3
input name = 0.0.5.7.1.1.11.19.1.13.13.1.1
How would I log into the page, 'click' that button, and download the pdf file within a script?

Maybe Mechanize module can help.
I think that url on clicking the button maybe generated using javascript.So, to run javascript code from python script take a look at Spidermonkey.

Try mechanize or twill. HttpFox or firebug can help you to build your queries. Remember you can also pickle cookies from browser and use it later with py libs. If the code is generated by javascript it could be possible to 'reverse engineer' it. If nof you can run some javascript interpret or use selenium or windmill to script a real browser.

You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). You may then be able to request the PDF directly.
It's difficult to help without seeing the page in question.

As Acorn said, you should try monitoring the actual requests and see if you can spot a pattern.
If not, then your best bet is actually to automate a fully-featured browser, that will be able to run Javascript, so you'll exactly mimic what a regular user would do. Have a look at this page on the Python Wiki for ideas, check the section Python Wrappers around Web "Libraries" and Browser Technology.

Related

Scraping a js based website for a javascript generated .csv download from python

I try to scrape a ajax/js based website for a .csv download.
If I visit the site in a browser, I need to click a button that calls a js function "downloadChartCSV()". This function generates and downloads a .csv file. That the file I'm after.
How can I automate this?
I have tried:
requests-html, which can login and render the page, but I miss an option to call that onClick event. There seems to be an open issue about exactly this: Github Issue
Selenium which probably works, but I don't want a full browser that is flashing up.
If you want to have the functionality of Selenium, but don't want a full browser flashing up: Use the PhantomJS Driver.
https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/
This is a very good Introduction into it.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Using Mechanize for python, need to be able to right click

My script logs in to my account, navigates the links it needs to, but I need to download an image. This seems to be easy enough to do using urlretrive. The problem is that the src attribute for the image contains a link which points it to the page which initiates a download prompt, and so my only foreseeable option is to right click and select 'save as'. I'm using mechanize and from what I can tell Mechanize doesn't have this functionality. My question is should I switch to something like Selenium?
Mechanize, last I checked, was pretty poorly maintained and documented. Selenium has a much more active community.
That being said: why do you need mechanize to do this? Why not just use urllib?
I would try to watch Chrome's network tab, and try to imitate the final request to get the image. If it turned out to be too difficult, then I would use selenium as you suggested.

Using Python's Requests library to navigate webpages / Click buttons

I'm new to web programming, and have recently began looking into using Python to automate some manual processes. What I'm trying to do is log into a site, click some drop-down menus to select settings, and run a report.
I've found the acclaimed requests library: http://docs.python-requests.org/en/latest/user/advanced/#request-and-response-objects
and have been trying to figure out how to use it.
I've successfully logged in using bpbp's answer on this page: How to use Python to login to a webpage and retrieve cookies for later usage?
My understanding of "clicking" a button is to write a post() command that mimics a click: Python - clicking a javascript button
My question (since I'm new to web programming and this library) is how I would go about pulling the data I need to figure out how I would construct these commands. I've been looking into [RequestObject].headers, .text, etc. Any examples would be great.
As always, thanks for your help!
EDIT:::
To make this question more concrete, I'm having trouble interacting with different aspects of a web-page. The following image shows what I'm actually trying to do:
I'm on a web-page that looks like this. There is a drop-down menu with click-able dates that can be changed. My goal is to automate changing the date to the most recent date, "click"'Save and Run', and download the report when it's finished running.
The only solution to this I have found is Selenium. If it werent a javascript heavy website you could try mechanize but for this you need to render the javascript and then inject javascript...like Selenium does.
Upside: You can record actions in Firefox (using selenium) then export those actions to python. The downside is that this code has to open a browser window to run.

python open web page and get source code

We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)

Categories

Resources