I am using Python bindings for Selenium with PhantomJS to scrape the contents of a website, like so.
The element I want to access is in the DOM but not in the HTML source. I understand that if I want to access elements in the DOM itself, I need to use the PhantomJS evaluate() function. (e.g. http://www.crmarsh.com/phantomjs/ ; http://phantomjs.org/quick-start.html)
How can I do this from within Selenium?
Here is part of my code (which is currently not able to access the element using a PhantomJS driver):
time.sleep(60)
driver.set_window_size(1024,768)
todays_points = driver.find_elements_by_xpath("//div/a[contains(text(),'Today')]/preceding-sibling::span")
total = 0
for today in todays_points:
driver.set_window_size(1024,768)
points = today.find_elements_by_class_name("stream_total_points")[0].text
points = int(points[:-4])
total += points
driver.close()
print total
Related
Chrome version: 105.0.5195.102
Selenium == 4.4.3
Python == 3.9.12
In a certain page, 'element.text' takes ~0.x seconds which is unbearable. I suppose 'element.text' should return literally just a text from a cached page so couldn't understand it takes so long time. How can I make it faster?
Here are similar QNAs but I need to solve the problem just with Selenium.
Parse text with BeatufulSoup
Parse text with lxml
Another question: Why every 'element.text' takes different times?
For example,
import chromedriver_autoinstaller
import time
from selenium import webdriver
chromedriver_autoinstaller.install(cwd=True)
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--no-sandbox')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option("excludeSwitches", ["enable-logging"])
wd = webdriver.Chrome(options=options)
wd.get("https://www.bbc.com/")
t0 = time.time()
e = wd.find_element(By.CSS_SELECTOR, "#page > section.module.module--header > h2")
print(time.time()-t0)
for i in range(10):
t0 = time.time()
txt = e.text
print(time.time()-t0)
# This prints different result for every loop.
wd.quit()
Selenium can be a bit slow because it does NOT work directly with Chrome. The communication is made via channel which is the Chrome web driver.
If you wish to work with a faster and better plugin for Automation try using PlayWright.
Another thing you can try is to find your element directly, and not using a long CSS or long Xpath expression. The longer your expression will be -> the longer it will take to find it and its text
I see the following output for your code:
0.0139617919921875
0.01196908950805664
0.003988742828369141
0.004987955093383789
0.003988027572631836
0.0039899349212646484
0.003989219665527344
0.004987955093383789
0.003987789154052734
0.003989696502685547
0.0049860477447509766
The first 2 times are about 12-14 milliseconds while the others are about 4 milliseconds.
The first action
wd.find_element(By.CSS_SELECTOR, "#page > section.module.module--header > h2")
is polling the DOM until it finds there element matching the given locator.
While the txt = e.text line uses already existing reference to the element on the page, so it does not perform any polling / searching, just access an element on the page via the existing reference (pointer) that's why it takes significantly less time.
Why the second time is long as the first I don't sure I know.
I run this test several times, got different outputs but mainly the picture was the +- same.
driver = webdriver.Chrome(driver_path, options=chrome_options)
wait = WebDriverWait(driver, 20)
driver.get('https://%s/' % asset_id)
wait.until(EC.presence_of_element_located((By.XPATH, "//*[#id='dev_diaginfo_fid']")))
print(driver.find_element_by_xpath('//*[#id='dev_diaginfo_fid']").get_attribute=("innerHTML"))
I'm able to log into the website and Selenium returns the WebElement but it is not consistent when returning the text from that WebElement. Sometimes it returns it and other times it seems like it isn't loading fast enough (slow network where this is being utilized) and returns no data at all but I can still see the WebElement itself just not the data. The data is dynamically loaded via JS. Probably not relevant but I am using send_keys to pass the credentials needed to login and then the page with the version is loaded.
Is there a way to use an ExpectedCondition (EC) to wait until it sees text before moving on? I'm attempting to pull the firmware version from a network device and it finds the Firmware element but it is not consistent when returning the actual firmware version. As stated before, there are issues with network speeds occasionally so my suspicion is that it's moving on before loading the firmware number. This device does not have internet access so I can't share the URL. I can confirm that I have pulled the firmware version but it's just not consistent.
I have tried passing it to beautifulsoup and can verify that it sees Firmware Version: but the inner tags are empty.
Edit: I have tried EC.visibility_of_all_elements and EC.visibility_of_element as well with no luck.
Here's an idea.
Try a while loop until you see the text.
counter = 0
elem = driver.find_element_by_xpath("//*[#id='dev_diaginfo_fid']").get_attribute=("innerHTML")
while "" in elem:
pause(500)
elem = driver.find_element_by_xpath("//*[#id='dev_diaginfo_fid']").get_attribute=("innerHTML")
if "" not in elem:
print("Success! The text is: " + elem)
break
if counter > 20:
print("Text still not found!")
break
counter += 1
Obviously, adjust the loop to suit your needs.
In Python 3 and selenium I have this script to automate the search of terms in a site with public information
from selenium import webdriver
# Driver Path
CHROME = '/usr/bin/google-chrome'
CHROMEDRIVER = '/home/abraji/Documentos/Code/chromedriver_linux64/chromedriver'
# Chosen browser options
chrome_options = webdriver.chrome.options.Options()
chrome_options.add_argument('--window-size=1920,1080')
chrome_options.binary_location = CHROME
# Website accessed
link = 'https://pjd.tjgo.jus.br/BuscaProcessoPublica?PaginaAtual=2&Passo=7'
# Search term
nome = "MARCONI FERREIRA PERILLO JUNIOR"
# Waiting time
wait = 60
# Open browser
browser = webdriver.Chrome(CHROMEDRIVER, options = chrome_options)
# Implicit wait
browser.implicitly_wait(wait)
# Access the link
browser.get(link)
# Search by term
browser.find_element_by_xpath("//*[#id='NomeParte']").send_keys(nome)
browser.find_element_by_xpath("//*[#id='btnBuscarProcPublico']").click()
# Searches for the text of the last icon - the last page button
element = browser.find_element_by_xpath("//*[#id='divTabela']/div[2]/div[2]/div[4]/div[2]/ul/li[9]/a").text
element
'»'
This site when searching for terms paginates results and always shows as the last pagination button the "»" button.
The next to last button in the case will be "›"
So I need to capture the button text always twice before the last one. Here is this case the number "8", to automate page change - I will know how many clicks on next page will be needed
Please, when I search Xpath how do I capture the element two positions before?
I know this is not an answer to the original question.
But clicking the next button several times is not a good practice.
I checked the network traffic and see that they are calling their API url with offset parameter. You should be able to use this URL with proper offset as you need.
If you really need to access the last but two, you need to get the all navigation buttons first and then access by indexing as follows.
elems = self.browser.find_elements_by_xpath(xpath)
elems[-2]
EDIT
I just tested their API and it works with proper cookie value given.
This way is much faster than automation using Selenium.
Use Selenium just to extract cookie value to be used in the header of the web request.
I realise that this question is very similar to this, and other SO questions. However, I've played around with the screen size, and also the wait times, and that has not fixed the problem.
I am
starting a driver
opening a website and logging in with Selenium
scraping specific values from the home page
This is the code that isn't working in PhantomJS (but works fine if I use chromedriver):
time.sleep(60)
driver.set_window_size(1024,768)
todays_points = driver.find_elements_by_xpath("//div/a[contains(text(),'Today')]/preceding-sibling::span")
total = 0
for today in todays_points:
driver.set_window_size(1024,768)
points = today.find_elements_by_class_name("stream_total_points")[0].text
points = int(points[:-4])
total += points
driver.close()
print total
The HTML that I'm trying to access is inside a div element:
<span class="stream-type">tracked 7 Minutes-Recode for <span class="stream_total_points">349 pts</span></span>
<a class="action_time gray_link" href="/entry/42350781/">Today</a>
I want to grab the '349 pts' text. However, with PhantomJS the value returned is always 0, so I think it's not finding that element.
EDIT: When I print the html source using print(driver.page_source) I get the correct page outputted but the element is not visible. Checking on chrome using the view source tool, I can't see the element here either (but I can with inspect element). Could this be why PhantomJS cannot access the element?
I'm trying to find a specific text from a page https://play.google.com/store/apps/details?id=org.codein.filemanager&hl=en using selenium python. I'm looking for the element name - current version from the above url. I used the below code
browser = webdriver.Firefox() # Get local session of firefox
browser.get(sampleURL) # Load page
elem = browser.find_elements_by_clsss_name("Current Version") # Find the query box
print elem;
time.sleep(2) # Let the page load, will be added to the API
browser.close()
I don't seem to get the output printed. Am I doing anything wrong here?
There is no class with name "Current Version". If you want to capture the version number that is below the "Current Version" text, the you can use this xpath expression:
browser.find_element_by_xpath("//div[#itemprop='softwareVersion']")