Selenium and Python - Page timeout without throwing an error

Selenium and Python - Page timeout without throwing an error - python

There is a lot that has been written about timeout and selenium and page loads.
But almost none of it works in chromedriver.
And all of what works is not exactly what I am looking for.
Note: I am not looking for set_page_load_timeout()
What do I want:
I say: driver.get("some-weird-slow-place")
chromedriver says: yes, yes... on my way
[15 seg later...] still on my way
[20 seg later...] okay sir... please do javascript window.stop();
But! Keep working as usual with whatever loaded elements you have.
Why I want this:
Because maybe I just want to get the url of the site and its title... and not the fancy huge background image or the crunchy brunchy punchy animated banners and multiple thousand jquery magics that are still loading.
What did I try:
driver.get(url)
driver.execute_script("setInterval(function(){ window.stop(); }, 20000);")
But it does not work, because driver.get() will wait until page is loaded before executing the script.

Why do you need selenium for this ? How about https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

Selenium doesn't realize page finished loading

I'm writing a trying to scrape some data from the following website:
http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/historico/renda-fixa/
It worked as expected for a while, but now it get stuck in loading the page at line 3.
url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/historico/renda-fixa/'
driver = webdriver.Chrome()
driver.get(url)
What is weird is that the page is in fact fully loaded, as I can browse through it without a problem, but chrome keeps showing me a "Connecting..." message in the bottom.
When selenium finally gives up and raises the TimeoutException, the "Connecting..." message dissapears and Chrome understands that the page is in fact fully loaded.
If I try to manually open the link in another tab, it does so in less than a second.
Is there a way I can overide the built in "wait until loaded" and just get to next steps, as everything i need is already loaded?

http://www.b3.com.br/lumis/portal/controller/html/SetLocale.jsp?lumUserLocale=pt_BR
This link loads inifinitely.
Report a bug an ask developers to fix.

Getting Wepage data same as browser INSPECT option

When I go to the following website: https://www.bvl.com.pe/mercado/movimientos-diarios and use Selenium's page_source option, or urllib.request.urlopen what I get is a different string than if I go to Google Chrome, and open the INSPECT option in the contextual menu and copy the entire thing.
From my research, I understand it has to do with Javascript running on the webpage and what I am getting is the base HTML.
What code can I use (Python) to get the same information?

That behavior entirely browser-dependent. The browser takes the raw HTML, processes it, runs a JS script (usually), styles it with CSS and does many other things. So to get such a result in Python you'd have to make your own web browser.

After much digging around, I came upon a solution that works in most cases. Use Headless Chrome with the --dump-dom switch.
https://developers.google.com/web/updates/2017/04/headless-chrome
Programmatically in Python use the subprocess module to run Chrome in a shell and either assign the output to a variable or direct the output to a text file.

selenium - webdriver won't go back to previous page

I am trying to click on a link, scrape data from that webpage, go back again, click on the next link and so on. But, I am not able to go back to the previous page for some reason. I observed that I can execute the code to go back if I am outside the loop, and I can't figure out what is wrong with the loop. I tried to use driver.back() too and yet it won't work. Any help is appreciated!! TYI
x = 0 #counter
contents=[]
for link in soup_level1.find_all('a', href=re.compile(r"^/new-homes/arizona/phoenix/"), tabindex=-1):
python_button =driver.find_element_by_xpath("//div[#class='clearfix len-results-items len-view-list']//a[contains(#href,'/new-homes/arizona/phoenix/')]")
driver.execute_script("arguments[0].click();",python_button)
driver.implicitly_wait(50)
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
a=soup_level2.find('ul', class_ ='plan-info-lst')
for names in a.find('li'):
contents.append(names.span.next_sibling.strip())
driver.execute_script("window.history.go(-1)")
driver.implicitly_wait(50)
x += 1

Some more information about your usecase interms of:
Selenium client version
WebDriver variant and version
Browser type and version
would have helped us to debug the issue in a better way.
However to go back to the previous page you can use either of the following solutions:
Using back(): Goes one step backward in the browser history.
Usage:
driver.back()
Using execute_script(): Synchronously Executes JavaScript in the current window/frame.
Usage:
driver.execute_script("window.history.go(-1)")
Usecase Internet Explorer
As per #james.h.evans.jr's comment in the discussion driver.navigate().back() blocks when back button triggers a javascript alert on the page if you are using internet-explorer at times back() may not work and is pretty much expected as ie navigates back in the history by using the COM GoBack() method of the IWebBrowser interface. Given that, if there are any modal dialogs that appear during the execution of the method, the method will block.
You may even face similar issues while invoking forward() in the history, and submitting forms. The GoBack method can be executed on a separate thread which would involve calling a few not-very-intuitive COM object marshaling functions e.g. CoGetInterfaceAndReleaseStream() and CoMarshalInterThreadInterfaceInStream() but there seems we can't do much about that.

Instead of using
driver.execute_script("window.history.go(-1)")
You can try using
driver.back() see here
Please be aware that this functionality depends entirely on the underlying driver. It’s just possible that something unexpected may happen when you call these methods if you’re used to the behavior of one browser over another.

PhantomJS loads much less HTML than other drivers

I'm trying to load one web page and get some elements from it. So the first thing I do is to check the page using "inspect element". When I search for the tags I'm looking for, I can see them (in Chrome).
But when I try to do driver.get(url) and then driver.find_element_by_..., it doesn't find those elements because they aren't in the source code.
I think that it is probably because it doesn't load the whole page but only a part.
Here is an example:
I'm trying to find ads on the web page.
PREPARED_TABOOLA_BLOCK = """//div[contains(#id,'taboola') and not(ancestor::div[contains(#id,'taboola')])]"""
driver = webdriver.PhantomJS(service_args=["--load-images=false"])
# driver = webdriver.Chrome()
driver.maximize_window()
def find_taboola_blocks_selenium(url):
driver.get(url)
taboola_blocks = driver.find_elements_by_xpath(PREPARED_TABOOLA_BLOCK)
return taboola_blocks
print len(find_taboola_blocks_selenium('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html'))
driver.get('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html')
print len(driver.page_source)
OUTPUTS:
Using PhantomJS:
0
85103
Using ChromeDriver:
3
420869
Do you know how to make PhantomJS to load as much Html as possible or any other way to solve this?

Can you compare the request that ChromeDriver is making versus the request you are making in PhantomJS? Since you are only doing GET for the specified url, you may not be including other request parameters that are needed to get the advertisements.
The open() method may give you a better representation of what you are looking for here: http://phantomjs.org/api/webpage/method/open.html

The reason for this is because PhantomJS, by default, renders in a really small window, which makes it load the mobile version of the site. And with the PhantomJSDriver, calling maximizeWindow() (or maximize_window() in python) does absolutely nothing, since there is no rendered window to maximize. You will have to explicitly set the window's render size with:
edit: Below is the Java solution. I'm not entirely sure what the Python solution would be when setting the window size, but it should be similar.
driver.manage().window().setSize(new Dimension(1920, 1200));
edit again: Found the python version:
driver.set_window_size(1920, 1200)
Hope that helps!

PhantomJS 1.x is a really old browser. It only uses SSLv3 (now disabled on most sites) by default and doesn't implement most cutting edge functionality.
Advertisement scripts are usually delivered over HTTPS (SSLv3/TLS) and usually use some obscure feature of JavaScript which is not well tested or simply not implemented in PhantomJS.
If you use PhantomJS < v1.9.8 then you should use those commandline options (service_args): --ignore-ssl-errors=true --ssl-protocol=any.
If iframes or strange cross-domain requests are necessary for the page/ads to work, then add --web-security=false to the service_args.
If this still doesn't solve the problem, then try upgrading to PhantomJS 2.0.0. You might need to compile it yourself on Linux.

How to delete Firefox cookies from webdriver in python?

when I can't delete FF cookies from webdriver. When I use the .delete_all_cookies method, it returns None. And when I try to get_cookies, I get the following error:
webdriver_common.exceptions.ErrorInResponseException: Error occurred when processing
packet:Content-Length: 120
{"elementId": "null", "context": "{9b44672f-d547-43a8-a01e-a504e617cfc1}", "parameters": [], "commandName": "getCookie"}
response:Length: 266
{"commandName":"getCookie","isError":true,"response":{"lineNumber":576,"message":"Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMLocation.host]","name":"NS_ERROR_FAILURE"},"elementId":"null","context":"{9b44672f-d547-43a8-a01e-a504e617cfc1} "}
How can I fix it?
Update:
This happens with clean installation of webdriver with no modifications. The changes I've mentioned in another post were made later than this post being posted (I was trying to fix the issue myself).

Hmm, I actually haven't worked with Webdriver so this may be of no help at all... but in your other post you mention that you're experimenting with modifying the delete cookie webdriver js function. Did get_cookies fail before you were modifying the delete function? What happens when you get cookies before deleting them? I would guess that the modification you're making to the delete function in webdriver-read-only\firefox\src\extension\components\firefoxDriver.js could break the delete function. Are you doing it just for debugging or do you actually want the browser itself to show a pop up when the driver tells it to delete cookies? It wouldn't surprise me if this modification broke.
My real advice though would be actually to start using Selenium instead of Webdriver since it's being discontinued in it's current incarnation, or morphed into Selenium. Selenium is more actively developed and has pretty active and responsive forms. It will continue to be developed and stable while the merge is happening, while I take it Webdriver might not have as many bugfixes going forward. I've had success using the Selenium commands that control cookies. They seem to be revamping their documentation and for some reason there isn't any link to the Python API, but if you download selenium rc, you can find the Python API doc in selenium-client-driver-python, you'll see there are a good 5 or so useful methods for controlling cookies, which you use in your own custom Python methods if you want to, say, delete all the cookies with a name matching a certain regexp. If for some reason you do want the browser to alert() some info about the deleted cookies too, you could do that by getting the cookie names/values from the python method, and then passing them to selenium's getEval() statement which will execute arbitrary js you feed it (like "alert()"). ... If you do go the selenium route feel free to contact me if you get a blocker, I might be able to assist.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.