I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.
Perhaps pywebkitgtk would do what you need.
see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.
You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/
Related
Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()
It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.
What I'm looking for, should give me something like this ->
There are many APIs available that can accomplish your task (more precisely the task you describe on your question, not the image :) ). I personally use diffbot, which I discovered after reading this. Beware though, for this kind of "content" extraction does not always end with success, because of the nature of web pages. Instead, it relies on heuristics and training and thus may not suffice for your specific purposes...
If you're wanting an entire screenshot of the page then something like https://stackoverflow.com/questions/1041371/alexa-api may help you?
Otherwise if you're just wanting to get a few key images from the page..
you could use mechanize to assit you. When you connect to a webpage you can search through all the links on the page using:
for link in br.links():
where br is your browser object.
You can see an example here:
Download all the links(related documents) on a webpage using Python
if you print dir(link) it will show you various properties such as link.text and link.url. furthermore you can import urlparse.urlsplit and use it on the url. You can direct the browser towards the URL and scrape the images as shown in the above example.
You should really use a search engines interpretation of the page and the images in it.
You could use, the python wrapper on the bing API, or the xGoogle library.
Beware the xGoogle library fakes to google as if a browser and may not be endorsed way to consume Google's data.
This one should help: http://palewi.re/posts/2008/04/20/python-recipe-grab-a-page-scrape-a-table-download-a-file/
Learns you how to scrape content and images and store it.
I know there is lxml and BeautifulSoup, but that won't work for my project, because I don't know in advance what the HTML format of the site I am trying to scrape an article off of will be. Is there a python-type module similar to Readability that does a pretty good job at finding the content of an article and returning it?
It's possible to do using PhantomJS (C++) or PyPhantomJS (Python).
They're both headless WebKit based browsers, which you can fully control from JavaScript. Because you can control it from JavaScript, I find it is really easy to do stuff such as scrape the content of an article.
PyPhantomJS also has a plugin system, so that's definitely a plus. :)
Extracting the real content from a content-page can not be done automatically - at least not with the standard tools. You have to define/identify where the real content is stored (by specifying the related CSS ID or class in your own HTML extraction code).
Using HTQL, the query is:
&html_main_text
I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.
It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text
If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.
Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.
I would like to get the dimensions (coordinates) for all the HTML elements of a webpage as they are rendered by a browser, that is the positions they are rendered at. For example, (top-left,top-right,bottom-left,bottom-right)
Could not find this in lxml. So, is there any library in Python that does this? I had also looked at Mechanize::Mozilla in Perl but, that seems difficult to configure/set-up.
I think the best way to do this for my requirement is to use a rendering engine - like WebKit or Gecko.
Are there any perl/python bindings available for the above two rendering engines? Google searches for tutorials on how to "plug-in" to the WebKit rendering engine is not very helpful.
lxml isn't going to help you at all. It isn't concerned about front-end rendering at all.
To accurately work out how something renders, you need to render it. For that you need to hook into a browser, spawn the page and run some JS on the page to find the DOM element and get its attributes.
It's totally possible but I think you should start by looking at how website screenshot factories work (as they'll share 90% of the code you need to get a browser launching and showing the right page).
You may want to still use lxml to inject your javascript into the page.
I agree with Oli, rendering the page in question and inspecting DOM via JavaScript is the most practical way IMHO.
You might find jQuery very useful here:
$(document).ready(function() {
var elem = $("div#some_container_id h1")
var elem_offset = elem.offset();
/* elem_offset is an object literal:
elem_offset = { x: 25, y: 140 }
*/
var elem_height = elem.height();
var elem_width = elem.width();
/* bottom_right is then
{ x: elem_offset.x + elem_width,
y: elem_offset.y + elem_height }
});
Related documentation is here.
Yes, Javascript is the way to go:
var allElements=document.getElementsByTagName("*"); will select all the elements in the page.
Then you can loop through this a extract the information you need from each element. Good documentation about getting the dimensions and positions of an element is here.
getElementsByTagName returns a nodelist not an array (so if your JS changes your HTML those changes will be reflected in the nodelist), so I'd be tempted to build the data into an AJAX post and send it to a server when it's done.
I was not able to find any easy solution (ie. Java/Perl/Python :) to hook onto Webkit/Gecko to solve the above rendering problem. The best I could find was the Lobo rendering engine written in Java which has a very clear API that does exactly what I want - access to both DOM and the rendering attributes of HTML elements.
JRex is a Java wrapper to Gecko rendering engine.
you have three main options:
1) http://www.gnu.org/software/pythonwebkit is webkit-based;
2) python-comtypes for accessing MSHTML (windows only)
3) hulahop (python-xpcom) which is xulrunner-based
you should get the pyjamas-desktop source code and look in the pyjd/ directory for "startup" code which will allow you to create a web browser application and begin, once the "page loaded" callback has been called by the engine, to manipulate the DOM.
you can perform node-walking, and can access the properties of the DOM elements that you require. you can look at the pyjamas/library/pyjamas/DOM.py module to see many of the things that you will need to be using in order to do what you want.
but if the three options above are not enough then you should read the page http://wiki.python.org/moin/WebBrowserProgramming for further options, many of which have been mentioned here by other people.
l.
You might consider looking at WWW::Selenium. With it (and selenium rc) you can puppet string IE, Firefox, or Safari from inside of Perl.
The problem is that current browsers don't render things quite the same. If you're looking for the standards compliant way of doing things, you could probably write something in Python to render the page, but that's going to be a hell of a lot of work.
You could use the wxHTML control from wxWidgets to render each part of a page individually to get an idea of it's size.
If you have a Mac you could try WebKit. That same article has some suggestions for solutions on other platforms too.