Capturing browser specific rendering of a webpage? - python

Is there any way to capture (image, pdf etc) how a webpage will look like in lets say chrome or I.E? I am guessing there will be different ways to do this for different browsers but is there any API, library or addon that does this?

Use selenium webdriver (has a python api) to remote control a browser and take a screenshot. Supports all major browsers as far as I'm aware.

Yes there are few wonderful websites providing this service and also some kinds of primitive to some advanced API services for capturing browser screenshots.
Browsershots.org
Its quite slow most of the times, may be due to the heavy traffic it has to withstand. However its one of the best screenshots provider.
[LINK]http://browsershots.org/xmlrpc/ Check this url to understand how to use the XMLRPC based API for browsershots.
And if you want some primitive and straight forward type thumbnailing services, may be the following sites work good for you.
http://www.thumbalizr.com/
http://api1.thumbalizr.com/?url=http://acpmasquerade.com&width=some_width
I checked another website webshotspro.com and when I queued one for a snapshot, it said my queue was behind 7053 other requests. the loading icon keeps rotating :P
Give a try with the XMLRPC call from Browsershots.org

Related

Python interact with webrowser (opend by a user)

I am searching for a way which allows me to interact with a webrowser (Firefox,Chrome/Chromium,Edge are the most important).
I am currently using pyautogui, to locate login,password fields to put the login data into them. But since you can extract much easier informations when you can use IDs or xPath or other identifiers on webpages, it would make sense to use that.
I tried Firefox with selenium but I run in some problems. Can I attache it to a user created session (do I need the processID or something like that?). (Can I choose between the normal private session of the current profile?
I need a solution which works on Windows and Linux(it would be nice if the major Linux distros would support it. But the most important distros are Fedora/Ubuntu for me.) mac would be optional but since I do not got any mac I am not able to test it anyway.
The way with debugger mode or similar does not work really well for me since the browser needs to get started in a special way.
Would it possible to use something like this:
Can Selenium interact with an existing browser session? ,
When I can retrieve the this information some how form the existing browser?
driver.command_executor._url
driver.session_id
(But when I understand that currently it only works with browsers started with selenium?)
When I use Selenum and start a browserwindow with it can I login to a website and the user is logged in on the webside on his browser window too(if they us the same profile)? (Or does selenium separate cookies?)
If you need additional information or have some hints please post them so I can see them.
Thank you in advance for your help
It seems that it is not possible to connect to a web browser which was opened by the user to my understanding. How ever I found two possible solutions which I am currently trying to evaluate.
Using pyautogui to access the web browser over scanned images and control it with keyboard and mouse. (It is possible to access the console with the right combinations too).
The other solution is maybe more stable. Writing an browser extension which controls the browser.

What information do I need when scraping a website that requires logging in?

I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?

Read Internet Explorer console output with Selenium

I'm working on testing automation in Internet Explorer 11 with Selenium and I'm looking to read any console output for errors. However, any research I pulled up lead to a 2 year old response saying that the IE driver doesn't support reading logs of any kind (see here). Has there been any update to this issue? If not, is there any workaround to reading JS errors in IE with Selenium?
No, there is no change to the log API not being implemented in the IE driver. One reason for this is the arrival of the W3C WebDriver Specification, which does not define any logging end points. Moreover, even if the driver were to implement the logging API, getting the console log in IE would still be impossible, since Internet Explorer does not provide any programmatic access to its debugging tools.
One approach to capturing JavaScript errors in IE is to set window.onerror and read any errors that occur there. Of course, this will not retrieve any JavaScript errors that occur during onLoad, or before the error handler is attached to the onError event. To accomplish that, another approach I've seen used is to use a proxy to inject the event handler script into the page before it gets to the browser. This blog post shows an example of how to do that. Even though the example is written in C#, the same technique can be applied in any of the other language bindings.

How can I programmatically interact with a website that uses an AJAX JBoss RichTree component?

I'm writing a python script to do some screen scraping of a public website. This is going fine, until I want to interact with an AJAX-implemented tree control. Evidently, there is a large amount of javascript controlling the AJAX requests. It seems that the tree control is a JBoss RichFaces RichTree component.
How should I interact with this component programatically?
Are there any tricks I should know about?
Should I try an implement a subset of the RichFaces AJAX?
Or would I be better served wrapping some code around an existing web-browser? If so, is there a python library that can help with this?
PhantomJS Is probably the most interesting project letting you do javascript stuff in a headless environment with a decent API. While it doesn't support python natively anymore, there are options for interacting with it. Check out the discussion here for more info.
There is also vanilla webkit (wrapped by Qt and then PyQT). Check out an example here.
Hope that helps :)
You need to make the AJAX calls from your client to the server and interpret the data. Interpreting the AJAX data is easier and less error-prone than scraping HTML any way.
Although it can be a bit tricky to figure out the AJAX API if it isn't documented. A network sniffer tool like wireshark can be helpful there, there may also be useful plugins for your browser to do the same nowadays. I haven't needed to do that for years. :-)

How to save a webpage by seleniumRC

I use seleniumRC to open a url, then how to save this web page? How to realize it like urllib.urlretrieve do it? But urllib can't operate javascript in the page. One more question: Will it save the whole page with what I see as seleniumRC open it?
It sounds like you are confusing two very different libraries.
urllib:
This module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
You can use python's urllib library to retrieve the raw markup from a valid URL. The library doesn't invoke any embedded javascript on the page, because the library never attempts to parse or render anything.
Selenium RC:
Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.
Selenium RC is used to automate testing. Execution of your tests occurs in a web browser via javascript, but this is a testing suite — you receive information about the status of your tests. Selenium RC does not provide any functionality to save an image of the rendered page.
Unless I've misinterpreted your question, you seem to be looking for a library that will allow you to retrieve an image of a rendered HTML page (including javascript DOM manipulation). If this is indeed the case, I would suggest looking into PyWebShot, which seems to provide exactly that functionality. You can view screenshots of it in action here (along with some additional info about it).
If it doesn't necessarily need to be a python library, there are a number of web services around that provide screenshots:
IE Web Renderer
Browsershots
BrowsrCamp
BrowserCam

Categories

Resources