I would like to get the dimensions (coordinates) for all the HTML elements of a webpage as they are rendered by a browser, that is the positions they are rendered at. For example, (top-left,top-right,bottom-left,bottom-right)
Could not find this in lxml. So, is there any library in Python that does this? I had also looked at Mechanize::Mozilla in Perl but, that seems difficult to configure/set-up.
I think the best way to do this for my requirement is to use a rendering engine - like WebKit or Gecko.
Are there any perl/python bindings available for the above two rendering engines? Google searches for tutorials on how to "plug-in" to the WebKit rendering engine is not very helpful.
lxml isn't going to help you at all. It isn't concerned about front-end rendering at all.
To accurately work out how something renders, you need to render it. For that you need to hook into a browser, spawn the page and run some JS on the page to find the DOM element and get its attributes.
It's totally possible but I think you should start by looking at how website screenshot factories work (as they'll share 90% of the code you need to get a browser launching and showing the right page).
You may want to still use lxml to inject your javascript into the page.
I agree with Oli, rendering the page in question and inspecting DOM via JavaScript is the most practical way IMHO.
You might find jQuery very useful here:
$(document).ready(function() {
var elem = $("div#some_container_id h1")
var elem_offset = elem.offset();
/* elem_offset is an object literal:
elem_offset = { x: 25, y: 140 }
*/
var elem_height = elem.height();
var elem_width = elem.width();
/* bottom_right is then
{ x: elem_offset.x + elem_width,
y: elem_offset.y + elem_height }
});
Related documentation is here.
Yes, Javascript is the way to go:
var allElements=document.getElementsByTagName("*"); will select all the elements in the page.
Then you can loop through this a extract the information you need from each element. Good documentation about getting the dimensions and positions of an element is here.
getElementsByTagName returns a nodelist not an array (so if your JS changes your HTML those changes will be reflected in the nodelist), so I'd be tempted to build the data into an AJAX post and send it to a server when it's done.
I was not able to find any easy solution (ie. Java/Perl/Python :) to hook onto Webkit/Gecko to solve the above rendering problem. The best I could find was the Lobo rendering engine written in Java which has a very clear API that does exactly what I want - access to both DOM and the rendering attributes of HTML elements.
JRex is a Java wrapper to Gecko rendering engine.
you have three main options:
1) http://www.gnu.org/software/pythonwebkit is webkit-based;
2) python-comtypes for accessing MSHTML (windows only)
3) hulahop (python-xpcom) which is xulrunner-based
you should get the pyjamas-desktop source code and look in the pyjd/ directory for "startup" code which will allow you to create a web browser application and begin, once the "page loaded" callback has been called by the engine, to manipulate the DOM.
you can perform node-walking, and can access the properties of the DOM elements that you require. you can look at the pyjamas/library/pyjamas/DOM.py module to see many of the things that you will need to be using in order to do what you want.
but if the three options above are not enough then you should read the page http://wiki.python.org/moin/WebBrowserProgramming for further options, many of which have been mentioned here by other people.
l.
You might consider looking at WWW::Selenium. With it (and selenium rc) you can puppet string IE, Firefox, or Safari from inside of Perl.
The problem is that current browsers don't render things quite the same. If you're looking for the standards compliant way of doing things, you could probably write something in Python to render the page, but that's going to be a hell of a lot of work.
You could use the wxHTML control from wxWidgets to render each part of a page individually to get an idea of it's size.
If you have a Mac you could try WebKit. That same article has some suggestions for solutions on other platforms too.
Related
I am using python3 in combination with beautifulsoup.
I want to check if a website is responsive or not. First I thought checking the meta tags of a website and see if there is something like this in it:
content="width=device-width, initial-scale=1.0
Accuracy is not that good using this method but I have not found something better.
Has anybody an idea?
Basically I want to do the same as Google did it here: https://search.google.com/test/mobile-friendly reduced to the output if the website is responsive or not (Y/N)
(Just a suggestion)
I am not an expert on this but my first thought is that you need to render the website and see if it "responds" to different screen sizes. I would normally use something like phantomjs to do this.
Apparently, you can do this in python with selenium (more info at https://stackoverflow.com/a/15699761/3727050). A more comprehensive list of technologies that can be used for this task can be found here. Note that these resources seem a bit old/outdated and some solutions fallback to python subprocess calling phantomjs.
The linked google test seems to
Load the page in a small browser and check:
The font-size to be readable
The distance between clickable elements to ensure the page is usable
I would however do the following:
Load the page in desktop mode, record each div's style.
Gradually reduce the size of the screen and see which percentage of these change style
In most cases, from a large screen to a phone size you should be seeing 1-3 distinct layouts which should be identifiable from the percentage of elements changing style
The above does not guarantee that the page is "mobile-friendly" (ie usable in a mobile) but it shows if the CSS are responsive.
I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander
Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.
Are there any tricks that will make this task easy?
Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.
If you run the following query in a chrome console, you'll see it returns everything you want.
document.getElementsByClassName('inline-text-org');
Returns
[<div class="inline-text-org" title="University of Manchester">University of Manchester</div>,
<div class="inline-text-org" title="University of California Irvine">University of California ...</div>
etc...
You can run JavaScript through python in a real life DOM using ghost.py.
This is really cool:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
"document.getElementsByClassName('inline-text-org');")
A very similar question was asked earlier here.
Quoted is selenium, originally a testing environment for web-apps.
I usually use Chrome's Developer Mode, which IMHO already gives even more details than Firefox.
For scraping dynamic content, you need not a simple scraper but a full-fledged headless browser.
dhamaniasad/HeadlessBrowsers: A list of (almost) all headless web browsers in existence is the fullest list of these that I've seen; it lists which languages each has bindings for.
(Note that more than a few of the listed projects are abandoned!)
I have a html file that has various html tags in it. This html also has a bunch of tables in it. I am processing this file using python. How do I find out what the size (length x width in pixels) when it is rendered by a browser (preferably chrome or firefox)?
I am essentially looking for the information when you do "inspect element" on a browser, and you are able to see the size of the various elements. I want to access this size in my python code.
I am using lxml to parse my html and can use selenium if needed.
edit: added #node.js incase I can use it to spit out the size of all the tables in a shell script and I can grab it in python.
You're going to want to use Selenium WebDriver to open the HTML file in an actual browser installed on the computer that your Python code is running on.
I'm not sure how you'd use the Selenium WebDriver API to find out how tall a rendered table is, but the value_of_css_property method might do it.
If you can call out shellscript, and you can use Node.js, I'm assuming you could also install and use PhantomJS, which is a headless WebKit port. (I.e. an actual honest to goodness WebKit renderer that just doesn't require a window to work.) This will let you use Javascript and the familiar web libraries to manipulate the document. As an example, the following gets you the width of the logo element towards the upper left Stack Overflow site:
page = require('webpage').create(); // create a new "browser"
page.open('http://stackoverflow.com/', function() {
// callback when loading completes
var logoWidth = page.evaluate(function() {
// This runs in the rendered page and uses the version of jQuery that SO loads.
return $('#hlogo').width();
});
console.log(logoWidth); // prints 250, the same as Chrome.
phantom.exit(); // for some reason you need to exit manually
});
The documentation for PhantomJS will tell you more about what you can do with it and how.
One caveat however is that loading a page takes a while, since it needs to fetch CSS and scripts and generally do everything a browser does. I'm not sure if and how PhantomJS does any caching, if it does it might make sense to reuse the same process for multiple scrapes of the same site.
What I'm looking for, should give me something like this ->
There are many APIs available that can accomplish your task (more precisely the task you describe on your question, not the image :) ). I personally use diffbot, which I discovered after reading this. Beware though, for this kind of "content" extraction does not always end with success, because of the nature of web pages. Instead, it relies on heuristics and training and thus may not suffice for your specific purposes...
If you're wanting an entire screenshot of the page then something like https://stackoverflow.com/questions/1041371/alexa-api may help you?
Otherwise if you're just wanting to get a few key images from the page..
you could use mechanize to assit you. When you connect to a webpage you can search through all the links on the page using:
for link in br.links():
where br is your browser object.
You can see an example here:
Download all the links(related documents) on a webpage using Python
if you print dir(link) it will show you various properties such as link.text and link.url. furthermore you can import urlparse.urlsplit and use it on the url. You can direct the browser towards the URL and scrape the images as shown in the above example.
You should really use a search engines interpretation of the page and the images in it.
You could use, the python wrapper on the bing API, or the xGoogle library.
Beware the xGoogle library fakes to google as if a browser and may not be endorsed way to consume Google's data.
This one should help: http://palewi.re/posts/2008/04/20/python-recipe-grab-a-page-scrape-a-table-download-a-file/
Learns you how to scrape content and images and store it.
I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.
Perhaps pywebkitgtk would do what you need.
see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.
You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/