Elements take too much time to load in a popup div - python

Trying to scrape subscribers data from this page. https://happs.tv/#Pablo .This is exactly like the facebook's likes box, which opens when we click on likes on a post. I need to scroll inside the pop-up which shows all those who liked a post. That works. However, the issue is that after 3-4000 names, the new names start taking an awful amount of time to load, 40 seconds for a new name, sometimes. Even so, the script fails, doesn't exit because there is no break but then keeps repeating the same names. What could I improve to get past this. I tried increasing the driver wait, should I increase it more? Kind of stuck here.
Here is the part after the pop-up div with all the subscribers is open. Perhaps a better way to scroll inside the div? Could it be because of the cache?Just a stab in the dark.
current_len = len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a'))
while True:
driver.find_element_by_xpath('//*[#id="userInfo"]/a').send_keys(Keys.END)
try:
WebDriverWait(driver, 35).until(lambda x: len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a')) > current_len)
current_len = len(driver.find_elements_by_xpath('//*[#id="userInfo"]/a'))
except TimeoutException:
name_eles = [name_ele for name_ele in driver.find_elements_by_xpath('//*[#id="userInfo"]/a')]
time.sleep(5)
for name in name_eles:
nt = name.text
n_li = name.get_attribute('href')
print(nt)
print(n_li)
dict1 = {"Given Name": nt, "URI": n_li}
with open('happstv.csv', 'a+', encoding='utf-8-sig') as f:
w = csv.DictWriter(f, dict1.keys())
if not header_added:
w.writeheader()
header_added = True
w.writerow(dict1)
INFO :- Just changed the driver to Firefox, seems to be going better. Will update question details, if any issues.

The response from an API should be less than 3s/request. If your request takes too much data, please load with "SELECT top 10". You should ask your team for performance first.
You can try FluentWait as
driver = Firefox()
driver.get("http://somedomain/url_that_delays_loading")
wait = WebDriverWait(driver, 10, poll_frequency=1, ignored_exceptions=[ElementNotVisibleException, ElementNotSelectableException])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//div")))

Related

How to break a loop if certain element is disabled and get text from multiple pages in Selenium Python

I am a new learner for python and selenium. I have written a code to extract data from multiple pages but there is certain problem in the code.
I am not able to break the a while loop function which clicks on next page until there is an option. The next page element disables after reaching the last page but code sill runs.
xpath: '//button[#aria-label="Next page"]'
Full SPAN: class="awsui_icon_h11ix_31bp4_98 awsui_size-normal-mapped-height_h11ix_31bp4_151 awsui_size-normal_h11ix_31bp4_147 awsui_variant-normal_h11ix_31bp4_219"
I am able to get the list of data which I want to extract from the webpage but I am getting on the last page data when I close the webpage from my end, ending the while loop.
Full Code:
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
base_url = "XYZ"
driver.maximize_window()
driver.get(base_url)
driver.set_page_load_timeout(50)
element = WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.ID, 'all-my-groups')))
driver.find_element(by=By.XPATH, value = '//*[#id="sim-issueListContent"]/div[1]/div/div/div[2]/div[1]/span/div/input').send_keys('No Stock')
dfs = []
page_counter = 0
while True:
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")))
cards = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
sims = []
for card in cards:
sims.append([card.text])
df = pd.DataFrame(sims)
dfs.append(df)
print(page_counter)
page_counter+=1
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
driver.close()
driver.quit()
I am also, attaching the image of the class and sorry I cannot share the URL as it of private domain.
The easiest option is to let your wait.until() fail via timeout when the "Next page" button is missing. Right now your line wait = WebDriverWait(driver, 30) is setting the timeout to 30 seconds; assuming the page normally loads much faster than that, you could change the timeout to be 5 seconds and then the loop will end faster once you're at the last page. If your page load times are sometimes slow then you should make sure the timeout won't accidentally cut off too early; if the load times are consistently fast then you might be able to get away with an even shorter timeout interval.
Alternatively, you could look through the specific target webpage more carefully to find some element that a) is always present and b) can be used to determine whether we're on the final page or not. Then you could read the value of that element and decide whether to break the loop before trying to find the "Next page" button. This could save a couple of seconds of waiting on the final loop iteration (avoid waiting for timeout) but may not be worth the trouble.
Change the below condtion
try:
wait.until(EC.element_to_be_clickable((By.XPATH,'//button[#aria-label="Next page"]'))).click()
except:
break
as shown in the below pseduocode #disabled is the diff that will make sure to exit the while loop if the button is disabled.
if(driver.find_elements_by_xpath('//button[#aria-label="Next page"][#disabled]'))).size()>0)
break
else
driver.find_element_by_xpath('//button[#aria-label="Next page"]').click()

Python loop through pages of website using Selenium

I've spent quite a bit of time on this and hoping to get some help...I'm new to Python and web scraping.
I'm accessing a website using credentials so I won't be able to share the link, but it's fairly straightforward and I have most of the code. Using Selenium, I'm able to access the website, input my credentials, access a table, pull in data I want, create a data frame, and go to the next page. But, I would like to automatically loop through all pages (with some pauses and being kind to the site) and append each page to a master. This is what I have so far:
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("username")
password.send_keys("password"+"\n")
driver.implicitly_wait(20)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
nxt=driver.find_element_by_xpath("//a[contains(#href, 'gotoNextPage(2)')]")
driver.execute_script("arguments[0].click();", nxt)
I think the best route is finding all the contains 'gotoNextPage' references and create a loop, but I'm unsure how to do so. Any help is appreciated very much.
UPDATE 1:
I've found something helpful where I use 'Next' instead of clicking the specific 'gotoNextPage' element. Here is my new code, however, it only appends the last page of info rather than appending as it goes through the pages. This is very close!
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("user name")
password.send_keys("password"+"\n")
while True:
driver.implicitly_wait(30)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
try:
driver.find_element_by_partial_link_text('Next').click()
except:
break
driver.quit()
print(information)

Selenium web scraping with "next" button clicking results in duplicate values

I'm using selenium and BeautifulSoup to scrape data from a website (http://www.grownjkids.gov/ParentsFamilies/ProviderSearch) with a next button, which I'm clicking in a loop. I was struggling with StaleElementReferenceException previously but overcame this by looping to refind the element on the page. However, I ran into a new problem - it's able to click all the way to the end now. But when I check the csv file it's written to, even though the majority of the data looks good, there's often duplicate rows in batches of 5 (which is the number of results that each page shows).
Pictoral example of what I mean: https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0
I have a hunch this is due to my program re-extracting the current data on the page every time I attempt to find the next button. I was confused why this would happen, since from my understanding, the actual scraping part happens only after you break out of the inner while loop which attempts to find the next button and into the larger one. (Let me know if I'm not understanding this correctly as I'm comparatively new to this stuff.)
Additionally, the data I output after every run of my program is different (which makes sense considering the error, since in the past, the StaleElementReferenceExceptions were occurring at sporadic locations. If it duplicates results every time this exception occurs, it would make sense for duplications to occur sporadically as well. Even worse, a different batch of results ends up being skipped each time I run the program as well - I cross-compared results from 2 different outputs and there were some results that were present in one and not the other.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")
url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"
driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)
driver.get(url)
#clears text box
driver.find_element_by_class_name("form-control").clear()
#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()
df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]
while True:
#keeps on clicking next button to fetch each group of 5 results
try:
nextButton = driver.find_element_by_class_name("next")
nextButton.send_keys('\n')
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts = 0
while (attempts < 100):
try:
nextButton = driver.find_element_by_class_name("next")
if nextButton:
nextButton.send_keys('\n')
break
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts += 1
#finds table of center data on the page
table = driver.find_element_by_id("results")
html_source = table.get_attribute('innerHTML')
soup = BeautifulSoup(html_source, "lxml")
#iterates through centers, extracting the data
for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
mini_list = []
#all fields except latlong
for row in center.find_all("div", {"class": "field"}):
material = row.find("div", {"class": "value"})
if material is not None:
mini_list.append(material.getText().encode("utf8").strip())
#parses latlong from link
for link in center.find_all('a', href = True):
content = link['href']
latlong = content[34:-1].split(',')
mini_list.append(latlong[0])
mini_list.append(latlong[1])
df_list.append(mini_list)
#writes content into csv
with open ('output_file.csv', "wb") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in df_list if row)
Anything would help! If there's other recommendations you have about the way I've used selenium/BeautifulSoup/python in order to improve my programming for the future, I would appreciate it.
Thanks so much!
I would use selenium to grab the results count then do an API call to get the actual results. You can either, in case result count is greater than limit for pageSize argument of queryString for API, loop in batches and increment the currentPage argument until you have reached the total count, or, as I do below, simply request all results in one go. Then extract what you want from the json.
import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver
initUrl = 'http://www.grownjkids.gov/ParentsFamilies/ProviderSearch'
driver = webdriver.Chrome()
driver.get(initUrl)
numResults = driver.find_element_by_css_selector('#totalCount').text
driver.quit()
newURL = 'http://www.grownjkids.gov/Services/GetProviders?latitude=40.2171&longitude=-74.7429&distance=10&county=&toddlers=false&preschool=false&infants=false&rating=&programTypes=&pageSize=' + numResults + '&currentPage=0'
data = requests.get(newURL).json()
You have a collection of dictionaries to iterate over in the response:
An example of writing out some values:
if(len(data)) > 0:
for item in data:
print(item['Name'], '\n' , item['Address'])
If you are worried about lat and long values you can grab them from one of the script tags when using selenium:
The alternate URL I use for XHR jQuery GET you can find by using dev tools (F12) on the page then refreshing the page with F5 and inspect the jquery requests made in the network tab:
You should read HTML contents inside every iteration of while loop. example below:
while counter < oage_number_limit:
counter = counter + 1
new_data = driver.page_source
page_contents = BeautifulSoup(new_data, 'lxml')

Improve Web Scraping for Elements in a Container Using Selenium

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.
here is a snippet of the code:
a = []
b = []
c = []
d = []
e = []
f = []
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for m in E:
e.append(m.text)
try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
Here is an edited version, but speed does not improve.
========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(#class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(#class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(#class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(#class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(#class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(#class,"partial_entry")]')
for i in E:
e.append(i.text)
try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(#class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(#class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(#class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break
For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash
Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.
Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.
Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.
Edit : Adding as edit while trying to answer #QHarr, as the answer is pretty long.
It is a suggestion to evaluate scrapy-splash.
I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.
At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.
But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.
If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.
Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.

Selenium in Python - open every link within a drop down menu

I'm new to Python, but I've been searching for the past hour about how to do this and this code almost works. I need to open up every category on a collapsing (dropdown) menu, and then Ctrl+t every link within that now .active class. The browser opens and all the categories open as well, but I'm not getting any of the .active links being opened in new tabs. I would appreciate any help.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("pioneerdoctor.com/productpage.cfm")
cat = driver.find_elements_by_css_selector("a[href*='Product_home']")
for i in cat:
i.click()
child = driver.find_elements_by_css_selector("li.active > a[href*='ProductPage']")
for m in child:
m.send_keys(Keys.CONTROL + 't')
EDIT:
Here's the current workaround I got going by writing to a text file and using webbrowser. The only issue I'm seeing is that it's writing duplicates of the results multiple times. I'll be looking through the comments later to see if I can get it working a better way (which I'm sure there is).
from selenium import webdriver
import webbrowser
print("Opening Google Chrome..")
driver = webdriver.Chrome()
driver.get("http://pioneerdoctor.com/productpage.cfm")
driver.implicitly_wait(.5)
driver.maximize_window()
cat = driver.find_elements_by_css_selector("a[href*='Product_home']")
print("Writing URLS to file..")
for i in cat:
i.click()
child = driver.find_elements_by_css_selector("a[href*='ProductPage']")
for i in child:
child = i.get_attribute("href")
file = open("Output.txt", "a")
file.write(str(child) + '\n')
file.close()
driver.quit
file = open("Output.txt", "r")
Loop = input("Loop Number, Enter 0 to quit: ")
Loop = int(Loop)
x = 0
if Loop == 0:
print("Quitting..")
else:
for z in file:
if x == Loop:
break
print("Done.\n")
else:
webbrowser.open_new_tab(z)
x += 1
None of the links in those categories are not found because the css selector for the links is incorrect. Remove the > in li.active > a[href*='ProductPage']. Why ? p > q gives you the immediate children of p. Space or "p q" gives you all the "q" inside p. The links you are interested in are NOT the immediate children of li. They are inside a UL which is inside the li.
The other problem is the way you open links in new tabs. Use this code instead:
combo = Keys.chord(Keys.CONTROL, Keys.RETURN)
m.sendKeys(combo)
Thats how I do it in Java. I think that python should have Keys.chord. If I were you, then I would open the links in another browser instance. I have seen that switching between tabs and windows is not supported well by selenium itself. Bad things can happen.
Before you try any tabbing, make a simple example to open a new tab and switch back to the previous tab. Do the back and forth 3-4 times. Does it work smoothly ? Good. Then, do that with 3-5 tabs. Tell me how was your experience.

Categories

Resources