In selenium doc we can see that we must set some timeout for wait.
For example: code from that doc
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))
I wonder do we always must set up some timeout? Or there is some method that will wait until all of the AJAX code will download and only after it our driver will interact with some web-elements(I mean without any fixed timeout , it just loads all things and only after it starts interacting)?
Hopefully this code will help you. This is how I solved this issue.
#Check with jQuery if it has any outstanding ajax
def ajax_complete(self):
try:
return 0 == self.execute_script("return jQuery.active")
except:
pass
#Create a method to wait for ajax to complete
driver.wait_for_ajax = lambda: WebDriverWait(driver, 10).until(ajax_complete, "")
driver.implicitly_wait(30)
Related
I am a new learner for python and selenium. I have written a code to extract data from multiple pages but there is certain problem in the code.
I am not able to break the a while loop function which clicks on next page until there is an option. The next page element disables after reaching the last page but code sill runs.
xpath: '//button[#aria-label="Next page"]'
Full SPAN: class="awsui_icon_h11ix_31bp4_98 awsui_size-normal-mapped-height_h11ix_31bp4_151 awsui_size-normal_h11ix_31bp4_147 awsui_variant-normal_h11ix_31bp4_219"
I am able to get the list of data which I want to extract from the webpage but I am getting on the last page data when I close the webpage from my end, ending the while loop.
Full Code:
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
base_url = "XYZ"
driver.maximize_window()
driver.get(base_url)
driver.set_page_load_timeout(50)
element = WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.ID, 'all-my-groups')))
driver.find_element(by=By.XPATH, value = '//*[#id="sim-issueListContent"]/div[1]/div/div/div[2]/div[1]/span/div/input').send_keys('No Stock')
dfs = []
page_counter = 0
while True:
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")))
cards = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
sims = []
for card in cards:
sims.append([card.text])
df = pd.DataFrame(sims)
dfs.append(df)
print(page_counter)
page_counter+=1
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
driver.close()
driver.quit()
I am also, attaching the image of the class and sorry I cannot share the URL as it of private domain.
The easiest option is to let your wait.until() fail via timeout when the "Next page" button is missing. Right now your line wait = WebDriverWait(driver, 30) is setting the timeout to 30 seconds; assuming the page normally loads much faster than that, you could change the timeout to be 5 seconds and then the loop will end faster once you're at the last page. If your page load times are sometimes slow then you should make sure the timeout won't accidentally cut off too early; if the load times are consistently fast then you might be able to get away with an even shorter timeout interval.
Alternatively, you could look through the specific target webpage more carefully to find some element that a) is always present and b) can be used to determine whether we're on the final page or not. Then you could read the value of that element and decide whether to break the loop before trying to find the "Next page" button. This could save a couple of seconds of waiting on the final loop iteration (avoid waiting for timeout) but may not be worth the trouble.
Change the below condtion
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
as shown in the below pseduocode #disabled is the diff that will make sure to exit the while loop if the button is disabled.
if(driver.find_elements_by_xpath('//button[#aria-label="Next page"][#disabled]'))).size()>0)
break
else
driver.find_element_by_xpath('//button[#aria-label="Next page"]').click()
I am trying to use some explicit waits using Undetected Chromedriver (v2). Rather than executing the statements once the element had loaded, etc it appears to pause until the wait time expires.
When I use the normal selenium chromedriver everything works as expected ("opt-in" is closed in 1-2 seconds) and when I use sleeps instead of waits the statements are executed much quicker.
Can anyone see the problem?
Here's the code:
class My_Chrome(uc.Chrome):
def __del__(self):
pass
options = uc.ChromeOptions()
arguments = [
'--log-level=3', '--no-first-run', '--no-service-autorun', '--password-store=basic',
'--start-maximized',
'--window-size=1920, 1080',
'--credentials_enable_service=False',
'--profile.password_manager_enabled=False,'
'--add_experimental_option("detach", True)'
]
for argument in arguments:
options.add_argument(argument)
driver = My_Chrome(options=options)
wait = WebDriverWait(driver, 20)
driver.get('https://www.oddschecker.com')
try:
opt_in = wait.until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Not Now']/..")))
VirtualClick(driver, opt_in)
current_time('Closing opt-in')
except:
pass
I have made a Proxy Checker in python in combination with selenium so everytime its opening the selenium browser it uses a different proxy.. But not all the proxies work and I'm stuck with loading the page forever if the proxy is slow.. So strings as a key don't work because the page doesn't get loaded. Is there a function in Python to let me do something like when the page is not fully loaded in 10 seconds it should go to the next proxy? Thanks in advance!
My code so far:
# PROXY SETUP FOR THIS PROGRAM
def fly_setup(fly_url):
fly_options = webdriver.ChromeOptions()
fly_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 1
})
with open("proxies.txt") as fly_proxies:
lines = fly_proxies.readlines()
counter = 0
for proxy in lines:
fly_options.add_argument('--proxy-server=%s' % proxy.rstrip())
ad_chrome = webdriver.Chrome(options=fly_options)
ad_chrome.get(fly_url)
ad_source = ad_chrome.page_source
key = 'Vind ik leuk'
time.sleep(10)
if ad_chrome.set_page_load_timeout(10):
print("Page load took to long.. Going to next proxy ")
else:
if key not in ad_source:
print("Proxy not working! Going to next one ...")
ad_chrome.quit()
time.sleep(3)
else:
time.sleep(10)
ad_chrome.find_element_by_xpath('//*[#id="skip_bu2tton"]').click()
counter += 1
print("Total views : " + str(counter))
print("")
ad_chrome.quit()
time.sleep(3)
You can set a timeout limit using set_page_load_timeout like
driver.set_page_load_timeout(10)
If the page cannot be loaded within 10 seconds, then it will throw TimeoutException doc here, catch it and then switch to your next proxy.
In your code, if I assume lines contains all proxies, you can do something like this:
for proxy in lines:
fly_options.add_argument('--proxy-server=%s' % proxy.rstrip())
ad_chrome = webdriver.Chrome(options=fly_options)
ad_chrome.set_page_load_timeout(10)
try:
ad_chrome.get(fly_url)
except TimeoutException:
continue
This solution doesn't always work, especially when the page loads data using AJAX calls. In this case, bet on selenium's waits, wait for something that is only presented/clickable when the whole page finishes loading, then same idea, catch TimeoutException and continue your loop.
I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.
here is a snippet of the code:
a = []
b = []
c = []
d = []
e = []
f = []
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for m in E:
e.append(m.text)
try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
Here is an edited version, but speed does not improve.
========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for i in E:
e.append(i.text)
try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash
Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.
Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.
Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.
Edit : Adding as edit while trying to answer #QHarr, as the answer is pretty long.
It is a suggestion to evaluate scrapy-splash.
I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.
At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.
But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.
If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.
Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.
driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.