PhantomJS loads much less HTML than other drivers

PhantomJS loads much less HTML than other drivers - python

I'm trying to load one web page and get some elements from it. So the first thing I do is to check the page using "inspect element". When I search for the tags I'm looking for, I can see them (in Chrome).
But when I try to do driver.get(url) and then driver.find_element_by_..., it doesn't find those elements because they aren't in the source code.
I think that it is probably because it doesn't load the whole page but only a part.
Here is an example:
I'm trying to find ads on the web page.
PREPARED_TABOOLA_BLOCK = """//div[contains(#id,'taboola') and not(ancestor::div[contains(#id,'taboola')])]"""
driver = webdriver.PhantomJS(service_args=["--load-images=false"])
# driver = webdriver.Chrome()
driver.maximize_window()
def find_taboola_blocks_selenium(url):
driver.get(url)
taboola_blocks = driver.find_elements_by_xpath(PREPARED_TABOOLA_BLOCK)
return taboola_blocks
print len(find_taboola_blocks_selenium('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html'))
driver.get('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html')
print len(driver.page_source)
OUTPUTS:
Using PhantomJS:
0
85103
Using ChromeDriver:
3
420869
Do you know how to make PhantomJS to load as much Html as possible or any other way to solve this?

Can you compare the request that ChromeDriver is making versus the request you are making in PhantomJS? Since you are only doing GET for the specified url, you may not be including other request parameters that are needed to get the advertisements.
The open() method may give you a better representation of what you are looking for here: http://phantomjs.org/api/webpage/method/open.html

The reason for this is because PhantomJS, by default, renders in a really small window, which makes it load the mobile version of the site. And with the PhantomJSDriver, calling maximizeWindow() (or maximize_window() in python) does absolutely nothing, since there is no rendered window to maximize. You will have to explicitly set the window's render size with:
edit: Below is the Java solution. I'm not entirely sure what the Python solution would be when setting the window size, but it should be similar.
driver.manage().window().setSize(new Dimension(1920, 1200));
edit again: Found the python version:
driver.set_window_size(1920, 1200)
Hope that helps!

PhantomJS 1.x is a really old browser. It only uses SSLv3 (now disabled on most sites) by default and doesn't implement most cutting edge functionality.
Advertisement scripts are usually delivered over HTTPS (SSLv3/TLS) and usually use some obscure feature of JavaScript which is not well tested or simply not implemented in PhantomJS.
If you use PhantomJS < v1.9.8 then you should use those commandline options (service_args): --ignore-ssl-errors=true --ssl-protocol=any.
If iframes or strange cross-domain requests are necessary for the page/ads to work, then add --web-security=false to the service_args.
If this still doesn't solve the problem, then try upgrading to PhantomJS 2.0.0. You might need to compile it yourself on Linux.

Related

First paint time in Python (may be using Selenium or without it)?

I want to find the time whatever (an object, image, text, link, DB or anything) loads first in a requested website using Python and Selenium.

Checkout performance.timing, it's JavaScript and comes default in your browser. You have a lot of options to display, like:
navigationStart
connectStart
connectEnd
domLoading
domInteractive
domComplete
Just go to your console window in your browser and type performance.timing. Might be of use to you.
If you find something you can use, you can use selenium to execute the JavaScript inside the browser using execute_script:
driver.execute_script(‘return performance.timing.domComplete’)

Selenium: How to disable image loading with firefox and python?

I have read similar questions and one was supposed to be the answer, but when I tried it, it only gave a partial solution.
I refer to this question: Disable images in Selenium Python
My problem is that I tried the solution and some of the images do not appear, but images that arrive from:
<img href="www.xxx.png">
Are being loaded.
Is there a way to tell firefox/selenium not to get it?
If not, is there a way to discard it from the dom element that I get back, via:
self._browser.get(url)
content=self._browser.page_source
for example by doing some kind of find replace on the dom tree?
The browser configuration is the same browser from the previous question:
firefox_profile = webdriver.FirefoxProfile()
# Disable CSS
firefox_profile.set_preference('permissions.default.stylesheet', 2)
# Disable images
firefox_profile.set_preference('permissions.default.image', 2)
# Disable Flash
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
# Set the modified profile while creating the browser object
self._browser = webdriver.Firefox(firefox_profile=firefox_profile)
Update:
I kept on digging and what I learned is that if I inspect the text document that the selenium/firefox combo did I see that, it didn't bring the images and kept them as links.
But when I did:
self._browser.save_screenshot("info.png")
I got a 24 mega file with all the img links loaded.
Can anyone explain to me this matter?
Thanks!

You can disable images using the following code:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
driver = webdriver.Firefox(firefox_profile=firefox_profile)
if you need to block some specific url... hm...
I think you need to add string:
127.0.0.1 www.someSpecificUrl.com
to the hosts file before test start and delete it after test finish.

In the latest Firefox versions permissions.default.image can't be changed. To disable the images, either switch to ChromDriver or use alternative extentions as suggested here.

Follow a link with Ghost.py

I'm trying to use Ghost.py to do some web scraping. I'm trying to follow a link but the Ghost doesn't seem to actually evaluate the javascript and follow the link. My problem is that i'm in an HTTPS session and cannot use redirection. I've also looked at other options (like selenium) but I cannot install a browser on the machine that will run the script. I also have some javascript evaluation further so I cannot use mechanize.
Here's what I do...
## Open the website
page,resources = ghost.open('https://my.url.com/')
## Fill textboxes of the form (the form didn't have a name)
result, resources = ghost.set_field_value("input[name=UserName]", "myUser")
result, resources = ghost.set_field_value("input[name=Password]", "myPass")
## Submitting the form
result, resources = ghost.evaluate( "document.getElementsByClassName('loginform')[0].submit();", expect_loading=True)
## Print the link to make sure that's the one I want to follow
#result, resources = ghost.evaluate( "document.links[4].href")
## Click the link
result, resources = ghost.evaluate( "document.links[4].click()")
#print ghost.content
When I look at ghost.content, I'm still on the same page and result is empty. I noticed that when I add expect_loading=True when trying to evaluate the click, I get a timeout error.
When I try the to run the javascript in a Chrome Developper Tools console, I get
event.returnValue is deprecated. Please use the standard
event.preventDefault() instead.
but the page does load up the linked url correctly.
Any ideas are welcome.
Charles

I think you are using the wrong methods for that.
If you want to submit the form there's a special method for that:
page, resources = ghost.fire_on("loginform", "submit", expect_loading=True)
Also there's a special ghost.py method for performing a click:
ghost.click('#some-selector')
Another possibilty, if you just want to open that link could be:
link_url = ghost.evaluate("document.links[4]")[0]
ghost.open(link_url)
You only have to find the right selectors for that.
I don't know on which page you want to perform the task, thus I can't fix your code. But I hope this will help you.

Python webdriver in IE returns None for all find_elements

I am developing tests using the latest version of Selenium 2 in python, installed with pip install -U selenium. I have a series of tests that run correctly using webdriver.Firefox(), but do not with webdriver.Ie(). It opens and navigates to the page correctly, but any attempt to access elements in that page fails. It does not appear to be a problem in other pages, but I cannot identify what would be causing the problem with mine.
I can generate the problem easily by building an instance of the webdriver with:
from selenium import webdriver
browser = webdriver.Ie()
browser.get("page url")
browser.find_elements_by_tag_name("html") #returns None!
I'm looking for any clues as to why this might be.

If you are exactly using this code, there are absolutely some problems.
First, I think it's better to assign return value of find to a variable and then check its value; I mean html_tags = browser.find_elements_by_tag_name("html").
After that, I thin you should use browser.find_element_by_tag_name("html"); because every page has just one tag with name "html". Your code is true if you want to achieve all tag elements with name "html".
Finally I think you want to access page source; in that case, you should use one of these:
html_text = browser.find_element_by_tag_name("html").text
html_text = browser.find_elements_by_tag_name("html")[0].text
Just to cover all the states, could you please tell me what is the URL that you are trying to access; because I tried multiple webpages with your code (exactly yours, not after my own edits) and everything was fine.

How to delete Firefox cookies from webdriver in python?

when I can't delete FF cookies from webdriver. When I use the .delete_all_cookies method, it returns None. And when I try to get_cookies, I get the following error:
webdriver_common.exceptions.ErrorInResponseException: Error occurred when processing
packet:Content-Length: 120
{"elementId": "null", "context": "{9b44672f-d547-43a8-a01e-a504e617cfc1}", "parameters": [], "commandName": "getCookie"}
response:Length: 266
{"commandName":"getCookie","isError":true,"response":{"lineNumber":576,"message":"Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMLocation.host]","name":"NS_ERROR_FAILURE"},"elementId":"null","context":"{9b44672f-d547-43a8-a01e-a504e617cfc1} "}
How can I fix it?
Update:
This happens with clean installation of webdriver with no modifications. The changes I've mentioned in another post were made later than this post being posted (I was trying to fix the issue myself).

Hmm, I actually haven't worked with Webdriver so this may be of no help at all... but in your other post you mention that you're experimenting with modifying the delete cookie webdriver js function. Did get_cookies fail before you were modifying the delete function? What happens when you get cookies before deleting them? I would guess that the modification you're making to the delete function in webdriver-read-only\firefox\src\extension\components\firefoxDriver.js could break the delete function. Are you doing it just for debugging or do you actually want the browser itself to show a pop up when the driver tells it to delete cookies? It wouldn't surprise me if this modification broke.
My real advice though would be actually to start using Selenium instead of Webdriver since it's being discontinued in it's current incarnation, or morphed into Selenium. Selenium is more actively developed and has pretty active and responsive forms. It will continue to be developed and stable while the merge is happening, while I take it Webdriver might not have as many bugfixes going forward. I've had success using the Selenium commands that control cookies. They seem to be revamping their documentation and for some reason there isn't any link to the Python API, but if you download selenium rc, you can find the Python API doc in selenium-client-driver-python, you'll see there are a good 5 or so useful methods for controlling cookies, which you use in your own custom Python methods if you want to, say, delete all the cookies with a name matching a certain regexp. If for some reason you do want the browser to alert() some info about the deleted cookies too, you could do that by getting the cookie names/values from the python method, and then passing them to selenium's getEval() statement which will execute arbitrary js you feed it (like "alert()"). ... If you do go the selenium route feel free to contact me if you get a blocker, I might be able to assist.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.