How can i click Google search result with selenium-webdriver - python

I have problem with following task:
Open Google start page
Type request in search form
Choose result where url matches some given url(for example http://www.theguardian.com)
Currently i have this script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://google.com/")
search_form = driver.find_element_by_xpath("/html/body/div/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div[3]/div/input[1]")
search_form.send_keys("guardian")
search_form.send_keys(Keys.ENTER)
driver.find_element_by_xpath('//a[starts-with(#href,"http://www.theguardian.com")]').click()
It succesfully executes first 2 subtasks but when on last line throws exception:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//a[starts-with(#href,\"http://www.theguardian.com\")]"}
Also i have this script which satisfies only last subtask:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
q = "guardian"
browser = webdriver.Firefox()
body = browser.find_element_by_tag_name("body")
body.send_keys(Keys.CONTROL + 't')
browser.get("https://www.google.com/search?q=" + q + "&start=" + str(counter))
browser.find_element_by_xpath('//a[starts-with(#href,"http://www.theguardian.com")]').click()
I works OK. My question is why first script throws exception on how can i modify it so it opens search result as second script does?
UPDATE:
As Bart and Shubham mentioned in comments, problem was in that i was trying to find element on page that wasn't yet loaded. So solution is to use 'wait'.
Selenium-webdriver provides 2 types of 'wait' -- explicit and implicit more on that in documentation.
For my solution i used implicit wait. Basically, it's telling WebDriver to wait for certain amount of time to find an element if it's not immediately available.
For that i just added 1 line to script:
driver.implicitly_wait(5)

Probably this is because the elements the first version is not yet on the page. If you create a "wait until element is present" kind of loop (do not know if it exists by heart) then it should work.
The second example does work because browser.get only returns if the page is loaded.

You can do something like below ...
Try to put some wait from first place
The code is in java but it is very similar/near to python, take a reference from it
You can check everytime that your element is present or not in your HTML DOM to prevent from error/failer of script. like below:-
if (driver.findElements(By.xpath('//a[starts-with(#href,"http://www.theguardian.com')).size() != 0) {
YOUR FIRST Working code
System.out.println("element exists");
}
else{
Your second working code
}
Hope it will help you :)

In this video, you can see how it can be done:
https://www.youtube.com/watch?v=IUBwtLG9hbs
Move to 12:30 to see it in action: Go to google.com, type something in the search box and click on a search result.

Related

Webscraping Click Button Selenium

I am trying to webscrape indeed.com to search for jobs using python, with selenium and beautifulsoup. I want to click next page but cant seem to figure out how to do this. Looked at many threads but it is unclear to me which element I am supposed to perform on. Here is the web page html and the code marked with grey comes up when I inspect the next button.
Also just to mention I tried first to follow what happens to the url when mousedown is executed. After reading the addppurlparam function and adding the strings in the function and using that url I just get thrown back to page one.
Here is my code for the class with selenium meant to click on the button:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome("C:/Users/alleballe/Downloads/chromedriver.exe")
driver.get("https://se.indeed.com/Internship-jobb")
print(driver.title)
#assert "Python" in driver.title
elem = driver.find_element_by_class_name("pagination-list")
elem = elem.find_element_by_xpath("//li/a[#aria-label='Nästa']")
print(elem)
assert "No results found." not in driver.page_source
assert elem
action = ActionChains(driver).click(elem)
action.perform()
print(elem)
driver.close()
The indeed site is formatted so that it shows 10 per page.
Your photo shows the wrong section of HTML instead you can see the links contain start=0 for the first page, start=10 for the second, start=20 for the third,...
You could use this knowledge to do a code like this:
while True:
i = 0
driver.get(f'https://se.indeed.com/jobs?q=Internship&start={i}')
# code here
i = i + 10
But, to directly answer to your question you should do:
next_page_link = driver.find_element_by_xpath('/html/head/link[6]')
driver.get(next_page_link)
This will find the link and then get it.
its work. paginated to next page.
driver.find_element_by_class_name("pagination-list").find_element_by_tag_name('a').click()

xpath returns more than one result, how to handle in python

I have started selenium using python. I am able to change the message text using find_element_by_id. I want to do the same with find_element_by_xpath which is not successful as the xpath has two instances. want to try this out to learn about xpath.
I want to do web scraping of a page using python in which I need clarity on using Xpath mainly needed for going to next page.
#This code works:
import time
import requests
import requests
from selenium import webdriver
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_element_by_id("user-message")
eleUserMessage.clear()
eleUserMessage.send_keys("Testing Python")
time.sleep(2)
driver.close()
#This works fine. I wish to do the same with xpath.
#I inspect the the Input box in chrome, copy the xpath '//*[#id="user-message"]' which seems to refer to the other box as well.
# I wish to use xpath method to write text in this box as follows which does not work.
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_elements_by_xpath('//*[#id="user-message"]')
eleUserMessage.clear()
eleUserMessage.send_keys("Test Python")
time.sleep(2)
driver.close()
To elaborate on my comment you would use a list like this:
eleUserMessage_list = driver.find_elements_by_xpath('//*[#id="user-message"]')
my_desired_element = eleUserMessage_list[0] # or maybe [1]
my_desired_element.clear()
my_desired_element.send_keys("Test Python")
time.sleep(2)
The only real difference between find_elements_by_xpath and find_element_by_xpath is the first option returns a list that needs to be indexed. Once it's indexed, it works the same as if you had run the second option!

Python: how i can print all the source code by using Selenium

driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.

Get visible content of a page using selenium and BeautifulSoup

I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.
The script I am using looks like this
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.
Any idea why this may happening? Something to do with replace_with maybe?
Sounds like the dom elements are not yet loaded when your code try to reach them.
Try to wait for the elements to be fully loaded and just then replace.
This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.
To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.
Here is a sample function
def _check_if_load_complete(driver, timeout=10):
elapsed_time = 1
while True:
if (driver.execute_script('return document.readyState') == 'complete' or
elapsed_time == timeout):
break
else:
sleep(0.0001)
elapsed_time += 1
then I used that function right after I changed the focus of the driver to the iframe
driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)
Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.
You can always use explicit wait of Selenium WebDriver.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName)
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
Correct me if this not work

How to use find_element_by_link_text() properly to not raise NoSuchElementException?

I have a HTML code like this:
<div class="links nopreview"><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://REFLECTION/INSTANCE/_CS_Data/null">Home</a></span> • <span><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://SITEADMIN/_CS_Site">Setup</a></span> • </span><span><a
title="Sign Out" class="csiAction csiActionLink">Sign Out</a></span></div>
I would like to click on the link that has the text Home. As this Home link appears after login, I have a code like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The part till login works great. However the last but one line is problematic
elem = browser.find_element_by_link_text("Home")
It raises this NoSuchElementException where the Home link is there as you can see from the HTML code.
raise exception_class(message, screen, stacktrace)
NoSuchElementException: Message: u'Unable to locate element: {"method":"link text","selector":"Home"}'
Any guidance as to what I am doing wrong, please?
Have you tried adding an implicit wait to this so that it waits instead of running to quickly.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.implicitly_wait(10) #wait 10 seconds when doing a find_element before carrying on
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The implicitly_wait call makes the browser poll until the item is on the page and visible to be interacted with.
The most common issues with NoSuchElementException while the element is there are:
the element is in different window/frame, so you've to switch to it first,
your page is not loaded or your method of page load is not reliable.
Solution could include:
check if you're using the right frame/window by: driver.window_handles,
write a wait wrapper to wait for an element to appear,
try XPath instead, like: driver.find_element_by_xpath(u'//a[text()="Foo"]').click(),
use pdb to diagnose your problem more efficiently.
See also: How to find_element_by_link_text while having: NoSuchElement Exception?
Maybe the element you are looking for doesn't exactly match that text string?
I know it can be tricky if it looks like it does on-screen, but sometimes there are oddities embedded like this simple markup "Home" or "Home" which makes the first char italic:
"<i>H</i>ome" is visually identical to "<em>H</em>ome" but does not match text.
Edit: after writing the above answer, I studied the question closer and discovered the HTML sample does show "Home" in plain text, but was not visible due to long lines not wrapping. So I edited the OP to wrap the line for readability.
New observation: I noticed that the Logout element has a "title" attribute, but the Home link element lacks such--try giving it one and using that.
Try adding an implicit wait to this in order to wait, instead of running too quickly.
Or
else you can import time and use time.sleep(25)

Categories

Resources