Improve Web Scraping for Elements in a Container Using Selenium - python

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.
here is a snippet of the code:
a = []
b = []
c = []
d = []
e = []
f = []
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for m in E:
e.append(m.text)
try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
Here is an edited version, but speed does not improve.
========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for i in E:
e.append(i.text)
try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash
Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.
Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.
Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.
Edit : Adding as edit while trying to answer #QHarr, as the answer is pretty long.
It is a suggestion to evaluate scrapy-splash.
I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.
At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.
But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.
If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.
Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.

Related

How to break a loop if certain element is disabled and get text from multiple pages in Selenium Python

I am a new learner for python and selenium. I have written a code to extract data from multiple pages but there is certain problem in the code.
I am not able to break the a while loop function which clicks on next page until there is an option. The next page element disables after reaching the last page but code sill runs.
xpath: '//button[#aria-label="Next page"]'
Full SPAN: class="awsui_icon_h11ix_31bp4_98 awsui_size-normal-mapped-height_h11ix_31bp4_151 awsui_size-normal_h11ix_31bp4_147 awsui_variant-normal_h11ix_31bp4_219"
I am able to get the list of data which I want to extract from the webpage but I am getting on the last page data when I close the webpage from my end, ending the while loop.
Full Code:
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
base_url = "XYZ"
driver.maximize_window()
driver.get(base_url)
driver.set_page_load_timeout(50)
element = WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.ID, 'all-my-groups')))
driver.find_element(by=By.XPATH, value = '//*[#id="sim-issueListContent"]/div[1]/div/div/div[2]/div[1]/span/div/input').send_keys('No Stock')
dfs = []
page_counter = 0
while True:
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")))
cards = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
sims = []
for card in cards:
sims.append([card.text])
df = pd.DataFrame(sims)
dfs.append(df)
print(page_counter)
page_counter+=1
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
driver.close()
driver.quit()
I am also, attaching the image of the class and sorry I cannot share the URL as it of private domain.
The easiest option is to let your wait.until() fail via timeout when the "Next page" button is missing. Right now your line wait = WebDriverWait(driver, 30) is setting the timeout to 30 seconds; assuming the page normally loads much faster than that, you could change the timeout to be 5 seconds and then the loop will end faster once you're at the last page. If your page load times are sometimes slow then you should make sure the timeout won't accidentally cut off too early; if the load times are consistently fast then you might be able to get away with an even shorter timeout interval.
Alternatively, you could look through the specific target webpage more carefully to find some element that a) is always present and b) can be used to determine whether we're on the final page or not. Then you could read the value of that element and decide whether to break the loop before trying to find the "Next page" button. This could save a couple of seconds of waiting on the final loop iteration (avoid waiting for timeout) but may not be worth the trouble.
Change the below condtion
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
as shown in the below pseduocode #disabled is the diff that will make sure to exit the while loop if the button is disabled.
if(driver.find_elements_by_xpath('//button[#aria-label="Next page"][#disabled]'))).size()>0)
break
else
driver.find_element_by_xpath('//button[#aria-label="Next page"]').click()

Elements take too much time to load in a popup div

Trying to scrape subscribers data from this page. https://happs.tv/#Pablo .This is exactly like the facebook's likes box, which opens when we click on likes on a post. I need to scroll inside the pop-up which shows all those who liked a post. That works. However, the issue is that after 3-4000 names, the new names start taking an awful amount of time to load, 40 seconds for a new name, sometimes. Even so, the script fails, doesn't exit because there is no break but then keeps repeating the same names. What could I improve to get past this. I tried increasing the driver wait, should I increase it more? Kind of stuck here.
Here is the part after the pop-up div with all the subscribers is open. Perhaps a better way to scroll inside the div? Could it be because of the cache?Just a stab in the dark.
current_len = len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a'))
while True:
driver.find_element_by_xpath('//*[#id="userInfo"]/a').send_keys(Keys.END)
try:
WebDriverWait(driver, 35).until(lambda x: len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a')) > current_len)
current_len = len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a'))
except TimeoutException:
name_eles = [name_ele for name_ele in driver.find_elements_by_xpath('//*[#id="userInfo"]/a')]
time.sleep(5)
for name in name_eles:
nt = name.text
n_li = name.get_attribute('href')
print(nt)
print(n_li)
dict1 = {"Given Name": nt, "URI": n_li}
with open('happstv.csv', 'a+', encoding='utf-8-sig') as f:
w = csv.DictWriter(f, dict1.keys())
if not header_added:
w.writeheader()
header_added = True
w.writerow(dict1)
INFO :- Just changed the driver to Firefox, seems to be going better. Will update question details, if any issues.
The response from an API should be less than 3s/request. If your request takes too much data, please load with "SELECT top 10". You should ask your team for performance first.
You can try FluentWait as
driver = Firefox()
driver.get("http://somedomain/url_that_delays_loading")
wait = WebDriverWait(driver, 10, poll_frequency=1, ignored_exceptions=[ElementNotVisibleException, ElementNotSelectableException])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//div")))

Python wait for document to be ready in selenium browser?

I have made a Proxy Checker in python in combination with selenium so everytime its opening the selenium browser it uses a different proxy.. But not all the proxies work and I'm stuck with loading the page forever if the proxy is slow.. So strings as a key don't work because the page doesn't get loaded. Is there a function in Python to let me do something like when the page is not fully loaded in 10 seconds it should go to the next proxy? Thanks in advance!
My code so far:
# PROXY SETUP FOR THIS PROGRAM
def fly_setup(fly_url):
fly_options = webdriver.ChromeOptions()
fly_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 1
})
with open("proxies.txt") as fly_proxies:
lines = fly_proxies.readlines()
counter = 0
for proxy in lines:
fly_options.add_argument('--proxy-server=%s' % proxy.rstrip())
ad_chrome = webdriver.Chrome(options=fly_options)
ad_chrome.get(fly_url)
ad_source = ad_chrome.page_source
key = 'Vind ik leuk'
time.sleep(10)
if ad_chrome.set_page_load_timeout(10):
print("Page load took to long.. Going to next proxy ")
else:
if key not in ad_source:
print("Proxy not working! Going to next one ...")
ad_chrome.quit()
time.sleep(3)
else:
time.sleep(10)
ad_chrome.find_element_by_xpath('//*[#id="skip_bu2tton"]').click()
counter += 1
print("Total views : " + str(counter))
print("")
ad_chrome.quit()
time.sleep(3)
You can set a timeout limit using set_page_load_timeout like
driver.set_page_load_timeout(10)
If the page cannot be loaded within 10 seconds, then it will throw TimeoutException doc here, catch it and then switch to your next proxy.
In your code, if I assume lines contains all proxies, you can do something like this:
for proxy in lines:
fly_options.add_argument('--proxy-server=%s' % proxy.rstrip())
ad_chrome = webdriver.Chrome(options=fly_options)
ad_chrome.set_page_load_timeout(10)
try:
ad_chrome.get(fly_url)
except TimeoutException:
continue
This solution doesn't always work, especially when the page loads data using AJAX calls. In this case, bet on selenium's waits, wait for something that is only presented/clickable when the whole page finishes loading, then same idea, catch TimeoutException and continue your loop.

Does selenium's waits always need a timeout?(Python)

In selenium doc we can see that we must set some timeout for wait.
For example: code from that doc
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))
I wonder do we always must set up some timeout? Or there is some method that will wait until all of the AJAX code will download and only after it our driver will interact with some web-elements(I mean without any fixed timeout , it just loads all things and only after it starts interacting)?
Hopefully this code will help you. This is how I solved this issue.
#Check with jQuery if it has any outstanding ajax
def ajax_complete(self):
try:
return 0 == self.execute_script("return jQuery.active")
except:
pass
#Create a method to wait for ajax to complete
driver.wait_for_ajax = lambda: WebDriverWait(driver, 10).until(ajax_complete, "")
driver.implicitly_wait(30)

Scraper: Try skips code in while loop (Python)

I am working on my first scraper and ran into an issue. My scraper accesses a website and saves links from the each result page. Now, I only want it to go through 10 pages. The problem comes when the search results has less than 10 pages. I tried using a while loop along with a try statement, but it does not seem to work. After the scraper goes through the first page of results, it does not return any links on the successive pages; however, it does not give me an error and stops once it reaches 10 pages or the exception.
Here is a snippet of my code:
links = []
page = 1
while(page <= 10):
try:
# Get information from the propertyInfo class
properties = WebDriverWait(driver, 10).until(lambda driver: driver.find_elements_by_xpath('//div[#class = "propertyInfo item"]'))
# For each listing
for p in properties:
# Find all elements with a tags
tmp_link = p.find_elements_by_xpath('.//a')
# Get the link from the second element to avoid error
links.append(tmp_link[1].get_attribute('href'))
page += 1
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click())
except ElementNotVisibleException:
break
I really appreciate any pointers on how to fix this issue.
You are explicitely catching ElementNotVisibleException exception and stopping on it. This way you won't see any error message. The error is probably in this line:
WebDriverWait(driver, 10).until(lambda driver:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click())
I assume lambda here should be a test, which is run until succeeded. So it shouldn't make any action like click. I actually believe that you don't need to wait here at all, page should be already fully loaded so you can just click on the link:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click()
This will either pass to next page (and WebDriverWait at the start of the loop will wait for it) or raise exception if no next link is found.
Also, you better minimize try ... except scope, this way you won't capture something unintentionally. E.g. here you only want to surround next link finding code not the whole loop body:
# ...
while(page <= 10):
# Scrape this page
properties = WebDriverWait(driver, 10).until(...)
for p in properties:
# ...
page += 1
# Try to pass to next page
try:
driver.find_element_by_xpath('//*[#id="paginador_siguiente"]/a').click()
except ElementNotVisibleException:
# Break if no next link is found
break

Categories

Resources