Scraper doesn't stop clicking on the next page button

Scraper doesn't stop clicking on the next page button - python

I've written a script in python in combination with selenium to get some names and corresponding addresses displayed upon a search and the search keyword is "Saskatoon". However, the data, in this case, traverse multiple pages. My script almost does everything except for one thing.
It still runs even though there are no more pages to traverse. The last page also holds ">" sign for next page option and is not grayed out.
Here is the link: Page_link
Search_keyword: Saskatoon (in the city/town field).
Here is what I've written:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("above_link")
time.sleep(3)
search_input = driver.find_element_by_id("cityField")
search_input.clear()
search_input.send_keys("Saskatoon")
search_input.send_keys(Keys.ENTER)
while True:
try:
wait.until(EC.visibility_of_element_located((By.LINK_TEXT, "›"))).click()
time.sleep(2)
except:
break
driver.quit()
BTW, I've just taken out the name and address part form this script which I suppose is not relevant here. Thanks.

You can use class attribute of > button as on last page it is "ng-scope disabled" while on rest pages - "ng-scope":
wait.until(EC.visibility_of_element_located((By.XPATH, "//li[#class='ng-scope']/a[.='›']"))).click()

Related

Can't select element from page [Selenium[

I have tried to scrap info from that site - specifically, from a table. Every time I occur, info that elements doesn't exist.
https://polygonscan.com/token/0x64a795562b02830ea4e43992e761c96d208fc58d
I try to add time.slep(5) to my code or scrolling down function to load all element - ineffective.
Do you have any advice for me?
EDIT
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Options
chrome_options = Options()
chrome_options.add_argument("--headless")
# Set drive
chrome_driver_path = r"C:\Users\kacpe\OneDrive\Pulpit\Python\Projekty\chromedriver.exe"
driver = webdriver.Chrome(chrome_driver_path, options=chrome_options)
driver.get("https://polygonscan.com/token/0x64a795562b02830ea4e43992e761c96d208fc58d")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//table/tbody/tr[0]")))
print(element)
except TimeoutException as e:
print(e)
I added code in regard to your request. So my main goal is to scrap content from the table at this site. I add Explicit Waits to my code and still I can't select anything from that table - it's looking like the script doesn't see anything from that area.

One way to try solve it, its using the Xpath of the element or the relative position of the same, to make Selenium, get allways the same "line" of position to return the "value" of the information that you are searching.
Ex1:find_element(:xpath,"//*[#id="wmd-input"]")#in that case it's the input of this check box.
If it doesnt work, try this one.
Ex2: browser.implicitly_wait(30) #makes a timer to load all the informations from the web to your machine.

Wait for class to load value after clicking button selenium python

After the website is loaded I click a button successfully which will then generate some numbers in this class
<div class="styles__Value-sc-1bfbyy7-2 eVmhyz"></div>
but not instantly, it will put them in one by one. Selenium will instantly grab the first value that gets put into the class but doesn't wait for the other values to get added. Any way to wait for it to load all the values in there before grabbing it.
Here is the python code I use for grabbing the value:
total = driver.find_element_by_xpath("//div[#class='styles__Value-sc-1bfbyy7-2 eVmhyz']").text

Selenium has a WebDriverWait method:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
browser = webdriver.Chrome()
delay = 5
total = WebDriverWait(browser, delay).until(expected_conditions.presence_of_element_located(<locator>)
I haven't tested it locally but it may work. There is also presence_of_all_elements_located method, you can find the details on this page.
Hope this helps!

Unable to use "explicit wait" in the right way

I've written a script using python with selenium to click on some links listed in the sidebar of google maps. When any of the items get clicked, the related information attached to each lead shows up in the right sided area. The script is doing fine. However, I've used hardcoded delay to do the job. How can I get rid of hardcoded delay by achieving the same with explicit wait. Thanks in advance.
Link to the site: website
The script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "replace_with_above_link"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "[id^='rlimg0_']"))):
item.location
time.sleep(3) #wish to try with explicit wait but can't find any idea
item.click()
driver.quit()
I tried with wait.until(EC.staleness_of(item)) instead of hardcoded delay but no luck.

If you want to wait until new data displayed after each clcik you may try below:
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "[id^='rlimg0_']"))):
div = driver.find_element_by_xpath("//div[#class='xpdopen']")
item.location
item.click()
wait.until(EC.staleness_of(div))

Can't get rid of hardcoded delay even when Explicit Wait is already there

I've written some code in python in combination with selenium to parse the different questions from quora.com. My scraper is doing it's job at this moment. The thing is I've used here hardcoded delay for the scraper to work, even when Explicit Wait has already been defined. As the page is an infinite scrolling one, i tried to make the scrolling process to a limited number. Now, I have got two questions:
Why wait.until(EC.staleness_of(page)) is not working within my scraper. It is commented out now.
If i use something else instead of page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link"))) the scraper throws an error: can't focus element.
Btw, I do not wish to go for page = driver.find_element_by_tag_name('body') this option.
Here is what I've written so far:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
for scroll in range(10):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
# wait.until(EC.staleness_of(page))
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()

You can try below code to get as much XHR as possible and then parse the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
links_counter = len(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "question_link"))))
while True:
page.send_keys(Keys.END)
try:
wait.until(lambda driver: len(driver.find_elements_by_class_name("question_link")) > links_counter)
links_counter = len(driver.find_elements_by_class_name("question_link"))
except TimeoutException:
break
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()
Here we scroll page down and wait up to 10 seconds for more links to be loaded or break the while loop if the number of links remains the same
As for your questions:
wait.until(EC.staleness_of(page)) is not working because when you scroll page down you don't get the new DOM - you just make XHR which adds more links into existed DOM, so the first link (page) will not be stale in this case
(I'm not quite confident about this, but...) I guess you can send keys only to nodes that can be focused (user can set focus manually), e.g. links, input fields, textareas, buttons..., but not content division (div), paragraphs (p), etc

Scraper unable to get names from next pages

I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()

You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)

It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraper doesn't stop clicking on the next page button - python

You can use class attribute of > button as on last page it is "ng-scope disabled" while on rest pages - "ng-scope": wait.until(EC.visibility_of_element_located((By.XPATH, "//li[#class='ng-scope']/a[.='›']"))).click()

Related

Can't select element from page [Selenium[

Wait for class to load value after clicking button selenium python

Unable to use "explicit wait" in the right way

Can't get rid of hardcoded delay even when Explicit Wait is already there

Scraper unable to get names from next pages

Categories

Resources