I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.
Related
I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?
You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
I have a basic structure set up for a crawler. Now I released it on some php driven websites and it works like a charm. Though now I want to it to build datasheets from ajax content.
At the moment I am using Mechanize for PYTHON and perl to build my crawler. Though Mechanize module does not execute AJAX. How do i get to the content that is build by asynchronomous ajax?
I know there is something called Selenium, a real browser to automate. But is this my only option?
Your can run a headless browser e.g phantomjs which understands JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.
There is another way but its messy.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey
I need to create a script that will log into an authenticated page and download a pdf.
However, the pdf that I need to download is not at a URL, but it is generated upon clicking on a specific input button on the page. When I check the HTML source, it only gives me the url of the button graphic and some obscure name of the button input, and action=".".
In addition, both the url where the button is and the form name is obscured, for example:
url = /WebObjects/MyStore.woa/wo/5.2.0.5.7.3
input name = 0.0.5.7.1.1.11.19.1.13.13.1.1
How would I log into the page, 'click' that button, and download the pdf file within a script?
Maybe Mechanize module can help.
I think that url on clicking the button maybe generated using javascript.So, to run javascript code from python script take a look at Spidermonkey.
Try mechanize or twill. HttpFox or firebug can help you to build your queries. Remember you can also pickle cookies from browser and use it later with py libs. If the code is generated by javascript it could be possible to 'reverse engineer' it. If nof you can run some javascript interpret or use selenium or windmill to script a real browser.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). You may then be able to request the PDF directly.
It's difficult to help without seeing the page in question.
As Acorn said, you should try monitoring the actual requests and see if you can spot a pattern.
If not, then your best bet is actually to automate a fully-featured browser, that will be able to run Javascript, so you'll exactly mimic what a regular user would do. Have a look at this page on the Python Wiki for ideas, check the section Python Wrappers around Web "Libraries" and Browser Technology.
I use seleniumRC to open a url, then how to save this web page? How to realize it like urllib.urlretrieve do it? But urllib can't operate javascript in the page. One more question: Will it save the whole page with what I see as seleniumRC open it?
It sounds like you are confusing two very different libraries.
urllib:
This module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
You can use python's urllib library to retrieve the raw markup from a valid URL. The library doesn't invoke any embedded javascript on the page, because the library never attempts to parse or render anything.
Selenium RC:
Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.
Selenium RC is used to automate testing. Execution of your tests occurs in a web browser via javascript, but this is a testing suite — you receive information about the status of your tests. Selenium RC does not provide any functionality to save an image of the rendered page.
Unless I've misinterpreted your question, you seem to be looking for a library that will allow you to retrieve an image of a rendered HTML page (including javascript DOM manipulation). If this is indeed the case, I would suggest looking into PyWebShot, which seems to provide exactly that functionality. You can view screenshots of it in action here (along with some additional info about it).
If it doesn't necessarily need to be a python library, there are a number of web services around that provide screenshots:
IE Web Renderer
Browsershots
BrowsrCamp
BrowserCam