Scraper unable to get names from next pages - python

I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()

You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)

It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source

Related

Webdriver not returning some data

I am trying to get some information from a website. The Web Inspector shows the html source, with what JavaScript rendered into it. So I wanted to use chromedriver to render it for the purpose of extracting certain information, which cannot be accessed by simply requesting the website.
Now what seems confusing, is that even the driver is not returning anything.
My code looks like this:
driver = webdriver.Chrome('path/Chromedriver')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
And the website is:
https://www.amundietf.co.uk/professional/product/view/LU1681038243
Is there anything else that gets rendered into the html, when the Web Inspector is opened, which Chromedriver is not able to handle?
Thanks for your answers in advance!
At least you need to accept privacy settings, than click validateDisclaimer to site:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
url = "https://www.amundietf.co.uk/professional/product/view/LU1681038243"
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_id("footer_tc_privacy_button_3").click()
driver.find_element_by_id("validateDisclaimer").click()
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fpFrame.fpBannerMore #blockleft>#part_principale_1")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
print(results)
After it you need to wait for your page to load and to define elements you are looking for correctly.
Your question really contains many questions, that should be solved one by one.
I just pointed out the first of the problems.
Update
I solved the issue.
You will need to parse result by yourself.
So, you had problems:
Did not click two buttons.
Did not wait for a table you need to load.
Did not have any waits. In Selenium you must use them.

Python, Selenium. Google Chrome. Web Scraping. How to navigate between 'tabs' in website

im quite noob in python and right now building up a web scraper in Selenium that would take all URL's for products in the clicked 'tab' on web page. But my code take the URL's from the first 'tab'. Code below. Thank you guys. Im starting to be kind of frustrated lol.
Screenshot
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from lxml import html
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.alza.sk/vypredaj-akcia-zlava/e0.htm'
driver.get(url)
driver.find_element_by_xpath('//*[#id="tabs"]/ul/li[2]').click()
links = []
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'blockFilter')))
link = driver.find_elements_by_xpath("//a[#class='name browsinglink impression-binded']")
for i in link:
links.append(i.get_attribute('href'))
finally:
driver.quit()
print(links)
To select current tab:
current_tab = driver.current_window_handle
To switch between tabs:
driver.switch_to_window(driver.window_handles[1])
driver.switch_to.window(driver.window_handles[-1])
Assuming you have the new tab url as TAB_URL, you should try:
from selenium.webdriver.common.action_chains import ActionChains
action = ActionChains(driver)
action.key_down(Keys.CONTROL).click(TAB_URL).key_up(Keys.CONTROL).perform()
Also, apparently the li doesn't have a click event, are you sure this element you are getting '//*[#id="tabs"]/ul/li[2]' has the aria-selected property set to true or any of these classes: ui-tabs-active ui-state-active?
If not, you should call click on the a tag inside this li.
Then you should increase the timeout parameter of your WebDriverWait to guarantee that the div is loaded.

Python Selenium Webdriver doesn't refresh html after changing dropdown value in AJAX pages

I'm trying to scrape an AJAX webpage using Python and Selenium. The problem is, when I change the dropdown value, the page content changes according to my selection, but the selenium returns the same old html code from the page. I'd appreciate if anyone can help. Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
url = "https://myurl.com/PATH"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
# change the dropdown value
sprintSelect = Select(driver.find_element_by_id("dropdown-select"))
sprintSelect.select_by_visible_text("DropDown_Value2")
html = driver.execute_script("return document.documentElement.outerHTML")
print(html)
You need to wait for the ajax to load the website after your selection.
Try to put implicit or explicit wait after selection.
driver.implicitly_wait(10) # 10 seconds
or if you know the tag/id etc. of the web element you want, try the explicit
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "some_ID"))

Can't grab all the pdf links within a table from a webpage

I've written a script in python in combination with selenium to scrape different pdf links generated upon clicking on the different numbers, as in 110015710, 110015670 etc located within a table from a webpage.
Site link
My script can click on those links, reveal the pdf files but parse only 5 of them out of many.
How can I get them all?
I've tried so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "replace_with_above_link"
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)
[driver.execute_script("arguments[0].click();",item) for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"tr.Iec")))]
for elem in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".IecAttachments li a[href$='.pdf']"))):
print(elem.get_attribute("href"))
driver.quit()
when you click the element it will doing XHR to request for pdf links, add delay after every click.
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"tr.Iec"))):
driver.execute_script("arguments[0].click();",item)
time.sleep(1)

Scraper doesn't stop clicking on the next page button

I've written a script in python in combination with selenium to get some names and corresponding addresses displayed upon a search and the search keyword is "Saskatoon". However, the data, in this case, traverse multiple pages. My script almost does everything except for one thing.
It still runs even though there are no more pages to traverse. The last page also holds ">" sign for next page option and is not grayed out.
Here is the link: Page_link
Search_keyword: Saskatoon (in the city/town field).
Here is what I've written:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("above_link")
time.sleep(3)
search_input = driver.find_element_by_id("cityField")
search_input.clear()
search_input.send_keys("Saskatoon")
search_input.send_keys(Keys.ENTER)
while True:
try:
wait.until(EC.visibility_of_element_located((By.LINK_TEXT, "›"))).click()
time.sleep(2)
except:
break
driver.quit()
BTW, I've just taken out the name and address part form this script which I suppose is not relevant here. Thanks.
You can use class attribute of > button as on last page it is "ng-scope disabled" while on rest pages - "ng-scope":
wait.until(EC.visibility_of_element_located((By.XPATH, "//li[#class='ng-scope']/a[.='›']"))).click()

Categories

Resources