How to print online webpage target element into image programatically?

How to print online webpage target element into image programatically? - python

Given an online webpage :
https://stackoverflow.com/users/1974961
Given a target element with id="REPUTATION" (here artificially bordered in red) in that webpage :
How to print into an image reputation_1974961.ext this element ?

Take a look at this library: https://www.npmjs.com/package/html2png
The html2png library lets you pass in an HTML string to its render method, and it will render the HTML into a PNG (returned as a buffer in its callback). You should then be able to save the buffer contents to a file using standard file I/O.
As for grabbing the HTML string of just that element: grab the full page with request or your request library of choice, then use something like Cheerio to target just the element you want and get its HTML. (Cheerio: https://www.npmjs.com/package/cheerio ).
There may be some gotchas, such as you may need to also grab some styling from the returned HTML and copy that into the rendering string, too, but this should help you find the right direction :)

Not exactly using a div id,but I was able to get this much using imgkit and playing around with wkhtmltopdf options. You need to install imgkit and wkhtmltopdf as mentioned in the link.
The crop options given might be different for you so play around with it. You can find all the wkhtmltopdf options here.
import imgkit
options = {
'crop-h': '300',
'crop-w': '400',
'crop-x': '100',
'crop-y': '430'
}
imgkit.from_url('https://stackoverflow.com/users/1974961/hugolpz?tab=questions', 'out.jpg',options=options)
Output (out.jpg)
This is not perfect as you can see, but is certainly one of the options you can consider.

Related

How to get image of particular web element as base64

I'm using Selenium web driver and Python to create automation scripts for web application testing. I need to implement verification that will compare two strings of encoded png files as base64: saved basic image and current image on page of same web element. There is a method in Selenium that allow to get page screenshot as base64 object
driver.get_screenshot_as_base64()
But how to get base64 screen of not the whole page, but just of particular image element on page without downloading it?
P.S. Other ways of comparing two images are acceptable also:)

There is an answer to another question that explains how to take a screenshot of an element here. Once you have that, you should be able to do a pixel by pixel comparison of the two images. You can google and find code examples for that.
I don't see a lot of info on base64 images. It seems like it would be a really cool, easy way to compare two images since you'd just do a quick string compare but selenium doesn't seem to support taking a screenshot of an element in base64. You could probably do some work to take the screenshot, convert it and the reference image to base64, but that would likely be more work than just using a library or comparing two images that has been done a bunch of times before and is all over the web.

The following should work according to the docs, but does not, and there is an open issue for it here: https://github.com/SeleniumHQ/selenium/issues/912. In the meantime, I would suggest https://stackoverflow.com/a/15870708/1415130
Find you web page element however you want - see docs Locating
Elements
login_form = driver.find_element_by_id('loginForm')
Then screen grab the element
screenshot = login_form.screenshot_as_base64()
To compare screenshots, I'm using Pillow.

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)

The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

I need to extract dom element of main contents in python

i'm working to extract main contents from web page without removing anything like image in python, yet most library just gives me back the text itself or cleaned dom elements.
i need dom elements themselves that contain main contents of article including image.
Is there any library for that purpose?
Thanks

If you mean getting whole dom node with img src="" then i believe beautifulsoup4 can do that.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
but with actual image i dont know you have to make separate request for image.
Or you can use selenium https://pypi.python.org/pypi/selenium, It will use your browser (Firefox, Chrome) so can do anything with extracting web contents

Python : Rendering part of webpage with proper styling from server

I am building a screen clipping app.
So far:
I can get the html mark up of the part of the web page the user has selected including images and videos.
I then send them to a server to process the html with BeautifulSoup to sanitize the html and convert all relative paths if any to absolute paths
Now I need to render the part of the page. But I have no way to render the styling. Is there any library to help me in this matter or any other way in python ?
One way would be to fetch the whole webpage with urllib2 and remove the parts of the body I don't need and then render it.
But there must be a more pythonic way :)
Note: I don't want a screenshot. I am trying to render proper html with styling.
Thanks :)

Download the complete webpage, extract the style elements and the stylesheet link elements and download the files referenced the latter. That should give you the CSS used on the page.

Finding rendered HTML element positions using WebKit (or Gecko)

I would like to get the dimensions (coordinates) for all the HTML elements of a webpage as they are rendered by a browser, that is the positions they are rendered at. For example, (top-left,top-right,bottom-left,bottom-right)
Could not find this in lxml. So, is there any library in Python that does this? I had also looked at Mechanize::Mozilla in Perl but, that seems difficult to configure/set-up.
I think the best way to do this for my requirement is to use a rendering engine - like WebKit or Gecko.
Are there any perl/python bindings available for the above two rendering engines? Google searches for tutorials on how to "plug-in" to the WebKit rendering engine is not very helpful.

lxml isn't going to help you at all. It isn't concerned about front-end rendering at all.
To accurately work out how something renders, you need to render it. For that you need to hook into a browser, spawn the page and run some JS on the page to find the DOM element and get its attributes.
It's totally possible but I think you should start by looking at how website screenshot factories work (as they'll share 90% of the code you need to get a browser launching and showing the right page).
You may want to still use lxml to inject your javascript into the page.

I agree with Oli, rendering the page in question and inspecting DOM via JavaScript is the most practical way IMHO.
You might find jQuery very useful here:
$(document).ready(function() {
var elem = $("div#some_container_id h1")
var elem_offset = elem.offset();
/* elem_offset is an object literal:
elem_offset = { x: 25, y: 140 }
*/
var elem_height = elem.height();
var elem_width = elem.width();
/* bottom_right is then
{ x: elem_offset.x + elem_width,
y: elem_offset.y + elem_height }
});
Related documentation is here.

Yes, Javascript is the way to go:
var allElements=document.getElementsByTagName("*"); will select all the elements in the page.
Then you can loop through this a extract the information you need from each element. Good documentation about getting the dimensions and positions of an element is here.
getElementsByTagName returns a nodelist not an array (so if your JS changes your HTML those changes will be reflected in the nodelist), so I'd be tempted to build the data into an AJAX post and send it to a server when it's done.

I was not able to find any easy solution (ie. Java/Perl/Python :) to hook onto Webkit/Gecko to solve the above rendering problem. The best I could find was the Lobo rendering engine written in Java which has a very clear API that does exactly what I want - access to both DOM and the rendering attributes of HTML elements.
JRex is a Java wrapper to Gecko rendering engine.

you have three main options:
1) http://www.gnu.org/software/pythonwebkit is webkit-based;
2) python-comtypes for accessing MSHTML (windows only)
3) hulahop (python-xpcom) which is xulrunner-based
you should get the pyjamas-desktop source code and look in the pyjd/ directory for "startup" code which will allow you to create a web browser application and begin, once the "page loaded" callback has been called by the engine, to manipulate the DOM.
you can perform node-walking, and can access the properties of the DOM elements that you require. you can look at the pyjamas/library/pyjamas/DOM.py module to see many of the things that you will need to be using in order to do what you want.
but if the three options above are not enough then you should read the page http://wiki.python.org/moin/WebBrowserProgramming for further options, many of which have been mentioned here by other people.
l.

You might consider looking at WWW::Selenium. With it (and selenium rc) you can puppet string IE, Firefox, or Safari from inside of Perl.

The problem is that current browsers don't render things quite the same. If you're looking for the standards compliant way of doing things, you could probably write something in Python to render the page, but that's going to be a hell of a lot of work.
You could use the wxHTML control from wxWidgets to render each part of a page individually to get an idea of it's size.
If you have a Mac you could try WebKit. That same article has some suggestions for solutions on other platforms too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to print online webpage target element into image programatically? - python

Given an online webpage : https://stackoverflow.com/users/1974961 Given a target element with id="REPUTATION" (here artificially bordered in red) in that webpage : How to print into an image reputation_1974961.ext this element ?

Related

How to get image of particular web element as base64

Extracting data from webpage using lxml XPath in Python

I need to extract dom element of main contents in python

Python : Rendering part of webpage with proper styling from server

Finding rendered HTML element positions using WebKit (or Gecko)

Categories

Resources