how to get new page's source code in selenium with python - python

Let's say there is a page and there is a "next page" button in it.
browser is initialized using webdriver.Chrome(chrome_driver_path).
I use page_source attribute to get the orginal page's source code and locate the button for next page. Then I send click to it, the browser will turn to next page in the same tab in Chrome.
browser.get(origin_page_url)
browser.find_element_by_xpath(btn_xpath).click()
time.sleep(5)
After above codes then I call page_source again, but what I get is the same as the original page's source code.
How can I get the newly opened page's source code?

You need to explicitly wait for the new page to load before getting the page source. This usually depends on the webpage you are working with. For instance, you may wait until a particular element becomes visible:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "myXpath")))
element.click();
After that call page_source
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
I have not tested this but, sure this will work for you.

Related

Having trouble referencing to a certain element on page with Selenium

I am having a terribly hard time referencing to a certain "next page" button on a website that I am trying to scrape links from [https://www.sreality.cz/adresar?strana=2]. If you scroll down you can see a red right arrow button that you can click to go to the next page and so the website load new dynamic content. Every approach seems to report the same exact error and I don't know how am I supposed to point to the element without running into it.
This is the code that I currently have :
from selenium import webdriver
chromedriver_path = "/home/user/Dokumenty/iCloud/RealityScraper/chromedriver"
driver = webdriver.Chrome(chromedriver_path)
print("WebDriver Successfully Initialized")
driver.get("https://www.sreality.cz/adresar?strana=2")
links = driver.find_elements_by_css_selector("h2.title a")
nextPage = driver.find_element_by_css_selector("li.paging-item a.btn-paging-pn.icof.icon-arr-right.paging-next")
for link in links:
print(link.get_attribute("href"))
nextPage.click()
The "nextPage" variable is holding a supposed value to be clicked on once the "links" variable search finishes scraping all the links from the company titles. However when I run this code I get an error :
selenium.common.exceptions.StaleElementReferenceException: Message:
stale element reference: element is not attached to the page document
I have been searching for various fixes online but none of them seemed to resolve the issue. I think that the issue at this point is not caused by the element not loading quickly enough but rather Selenium having trouble finding the element because of wrong reference.
Because of this I have tried using XPath to accurately point to the actual element and so I changed the "nextPage" variable to :
nextPage = driver.find_element_by_xpath("""/html/body/div[2]/div[1]/div[2]/div[2]/div[4]/div/div/div/div[2]/div/div[2]/ul[1]/li[12]/a""")
Which returns exactly the same error as stated above. I have been trying to find a solution to this for hours now and I can't understand where the issue lies. I would be grateful if anyone could explain to me what am I doing wrong. Thanks to anyone.
If you want to get all the ng-href tags from every page. Or you could look into their api.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
driver.get("https://www.sreality.cz/adresar?strana=2")
wait = WebDriverWait(driver, 10)
while True:
try:
links = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h2.title > a")))
#print(len(links))
for link in links:
print(link.get_attribute("ng-href"))
nextPage = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.btn-paging-pn.icof.icon-arr-right.paging-next")))
nextPage.click()
time.sleep(10)
except Exception as e:
print(e)
break
First of all never use the absolute xpath it will breakdown easily, Use the relative xpath.
Secondly, i think the error you are getting is because after clicking "Next" button for the first time it loads a new page. Which has a different DOM structure and that's why you are not able to find that element.
You can try searching for the element after every new page load (after clicking "Next" button everytime.)
// imports
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
// initialize
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
action = ActionChains(driver)
// Try to use the below code and see if it works.
Next_btn = wait.until(EC.presence_of_element_located((By.XPATH, '(//li[#class="paging-item"])[2]')))
action.move_to_element(Next_btn).click().perform()

HTML page seems to add iframe element when opening chrome by selenium in python

I am new to selenium and web automation tasks, and I am trying to code an application for automating papers search on PubMed, using chromedriver.
My aim is to click the top right "Sign in" button in the PubMed main page, https://www.ncbi.nlm.nih.gov/pubmed. So the problem is:
when I open PubMed main page manually, there are no iframes tags in the html source, and therefore the "Sign in" element should be simply accessible by its xpath "//*[#id="sign_in"]".
when same page is openened by selenium instead, I cannot find that "Sign in" element by its xpath, and if a try to inspect it, the html source seems to have embedded it in an <iframe> tag, so that it cannot be found anymore unless a driver._switch_to.frame method is executed. But if I open the html source code by Ctrl+Uand search for <iframe> element, there is still none of them. Here is the "Sign in" element inspection capture:
["Sign in" inspection][1]
I already got round this problem by:
bot = PubMedBot()
bot.driver.get('https://www.ncbi.nlm.nih.gov/pubmed')
sleep(2)
frames = bot.driver.find_elements_by_tag_name('iframe')
bot.driver._switch_to.frame(frames[0])
btn = bot.driver.find_element_by_xpath('/html/body/a')
btn.click()
But all I would like to understand is why the inspection code is different from the html source code, whether this <iframe> element is really appearing from nowhere, and if so why.
Thank you in advance.
You are facing issue due to synchronization .Kinddly find below solution to resolve your issue:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"path ofchromedriver.exe")
driver.maximize_window()
wait = WebDriverWait(driver, 10)
driver.get("https://www.ncbi.nlm.nih.gov/pubmed")
iframe = wait.until(EC.presence_of_element_located((By.TAG_NAME, "iframe")))
driver.switch_to.frame(iframe)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), 'Sign in to NCBI')]"))).click()
Output:
Inspect element:

Get page source after each button click in a loop with get_elements_by_xpath using Selenium

I would like to automatically download text files for ATS Blocks Download section on FINRA website. The problem is while I am able to click on the icon and open the file in the browser, I cannot get the page source after the click. driver.page_source returns the page source for the ATS Blocks Download section page (the one before the click).
Here is a piece of code I was trying out:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
driver = webdriver.Chrome(ChromeDriverManager().install())
URL = 'https://otctransparency.finra.org/otctransparency/'
driver.get(URL)
# Agree to the general terms
driver.find_element_by_xpath('//*[#class="btn btn-warning"]').click()
#go to ATS Blocks Download section
driver.find_element_by_xpath('//*[#href="/otctransparency/AtsBlocksDownload"]').click()
#wait for the page to fully load
time.sleep(5)
#click on each download icon
for element in driver.find_elements_by_xpath('//*[#src="./assets/icon_download.png"]'):
element.click()
print(driver.page_source)
How to get the page source after every element.click()?
To get page_source of all the pages.
You need to
Induce WebDriverWait and element_to_be_clickable()
Induce WebDriverWait and visibility_of_all_elements_located()
Induce WebDriverWait and number_of_windows_to_be()
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
URL = 'https://otctransparency.finra.org/otctransparency/'
driver.get(URL)
driver.maximize_window()
# Agree to the general terms
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,'//*[#class="btn btn-warning"]'))).click()
#go to ATS Blocks Download section
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,'//a[#href="/otctransparency/AtsBlocksDownload"]'))).click()
#click on each download icon
elements=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.XPATH,'//img[#src="./assets/icon_download.png"]')))
for link in range(len(elements)):
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//img[#src="./assets/icon_download.png"]')))
elements[link].click()
WebDriverWait(driver,10).until(EC.number_of_windows_to_be(2))
windowhandles=driver.window_handles
driver.switch_to.window(windowhandles[-1])
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.TAG_NAME,"pre")))
print(driver.page_source)
driver.close()
driver.switch_to.window(windowhandles[0])
After every time that you click element to download other browser tab is opened, in order to get the page source from the other tab use:
for element in driver.find_elements_by_xpath('//[#src="./assets/icon_download.png"]'):
element.click()
driver.switch_to.window(driver.window_handles[1])
driver.set_page_load_timeout(120)
print(driver.page_source)
driver.switch_to.window(driver.window_handles[0])
driver.set_page_load_timeout(120)
PS. Instead of doing the:
time.sleep(5)
You can do:
driver.set_page_load_timeout(120)
Be sure not to mix various "waiting" mechanisms, as it can result in unexpected behavior (See this StackOverflow post for "why").
Be careful when using setting an implicit wait time, since once it is set, it is set for the lifetime of the driver instance (source, although it has been said in various places across the web).
If you intend to have your driver wait on multiple pages, you should use WebDriverWait. As shown in other replies, WebDriverWait(driver, timeout) accepts a WebDriver instance as well as an integer which represents the amount of time to wait before throwing an TimeoutException, in other words it accepts a timeout.
You can create a new WebDriverWait instance every time you're trying to find an element, without having to create a new WebDriver instance with a new implicit wait time. Since each element may need to waited on for a differing duration, this is ideal. You could go as far as to create a wrapper function to encapsulate the use of WebDriverWait:
def PatientlyClick(by, path, driver, timeout):
WebDriverWait(driver,timeout).until(EC.element_to_be_clickable((by, path))).click()
The above snippet of code could be made prettier if you designed a class which encapsulated your WebDriver instance, but that might be unnecessary for your purposes (see Page Object Model Design Pattern).

How can I "click" download button on selenium in python

I want to download user data on Google analytics by using crawler so I write some code using selenium. However, I cannot click the "export" button. It always shows the error "no such element". I tried to use find_element_by_xpath, by_name and by_id.
I upload inspect of GA page below.
I TRIED:
driver.find_element_by_xpath("//*[#class='download-link']").click()
driver.find_element_by_xpath('//*[#id="ID-activity-userActivityTable"]/div/div[2]/span[6]/button')
driver.find_element_by_xpath('//*[#class='_GAD.W_DECORATE_ELEMENT.C_USER_ACTIVITY_TABLE_CONTROL_ITEM_DOWNLOAD']')
Python Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('/Users/parkjunhong/Downloads/chromedriver')
driver.implicitly_wait(3)
usrid = '1021'
url = 'https://analytics.google.com/analytics/web/#/report/app-visitors-user-activity/a113876882w169675624p197020837/_u.date00=20190703&_u.date01=20190906&_r.userId='+usrid+'&_r.userListReportStates=%3F_u.date00=20190703%2526_u.date01=20190906%2526explorer-
table.plotKeys=%5B%5D%2526explorer-table.rowStart=0%2526explorer-
table.rowCount=1000&_r.userListReportId=app-visitors-user-id'
driver.get(url)
driver.find_element_by_name('identifier').send_keys('ID')
idlogin = driver.find_element_by_xpath('//*[#id="identifierNext"]/span/span')
idlogin.click()
driver.find_element_by_name('password').send_keys('PASSWD')
element = driver.find_element_by_id('passwordNext')
driver.execute_script("arguments[0].click();", element)
#login
driver.find_element_by_xpath("//*[#class='download-link']").click()
#click the download button
ERROR:
Message: no such element: Unable to locate element
inspection of GA
your click element is in an iFrame (iFrame id="galaxyIframe" ...). Therefore, you need to tell the driver to switch from the "main" page to said iFrame. If you add this line of code after your #login it should work:
driver.switch_to.frame(galaxyIframe)
(If the frame did not have a name, you would use: iframe = driver.find_element_by_xpath("xpath-to-frame") and then driver.switch_to.frame(iframe)
To get back to your default frame, use:
driver.switch_to.default_content()
Crawling GA is generally a pain. Not just because you have these iFrames everywhere.
Apart from that, I would recommend looking into puppeteer, the new kid on the crawler block. Even though the prospect of switching to javascript from python may be daunting, it is worth it! Once you get into it, selenium will have felt super clunky.
You can try with the text:
If you want to click on 'Export'-
//button[contains(text(),'Export')]

Scraper unable to get names from next pages

I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()
You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)
It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source

Categories

Resources