Selenium Web Scraping: Find element by text not working in script

Selenium Web Scraping: Find element by text not working in script - python

I am working on a script to gather information off Newegg to look at changes over time in graphics card prices. Currently, my script will open a Newegg search on RTX 3080's through Chromedriver and then click on the link for Desktop Graphics Cards to narrow down my search. The part that I am struggling with is developing a for item in range loop that will let me iterate through all 8 search result pages. I know that I could do this by simply changing the page number in the URL, but as this is an exercise that I'm trying to use to learn Relative Xpath better, I want to do it using the Pagination buttons at the bottom of the page. I know that each button should contain inner text of "1,2,3,4 etc." but whenever I use text() = {item} in my for loop, it doesn't click the button. The script runs and doesn't return any exceptions, but doesn't do what I want it too. Below I have attached the HTML for the page as well as my current script. Any suggestions or hints are appreciated.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(9):
try:
#next_page_click = driver.find_element(By.XPATH("//button[text() = '{item + 1}']"))
print(next_page_click)
next_page_click.click()
except:
pass

The pagination buttons are out of the initially visible area.
In order to click these elements you will have to scroll the page until the element appears.
Also, you will need to click next page buttons starting from 2 up to 9 (including) while you trying to do this with numbers from 1 up to 9.
I think this should work better:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
actions = ActionChains(driver)
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(2,10):
try:
next_page_click = driver.find_element(By.XPATH(f"//button[text() = '{item}']"))
actions.move_to_element(next_page_click).perform()
time.sleep(2)
#print(next_page_click) - printing a web element itself will not give you usable information
next_page_click.click()
#let the next page loaded, it takes some time
time.sleep(5)
except:
pass

Related

How to extract the comments count correctly

I am trying to extract number of youtube comments and tried several methods.
My Code:
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
DRIVER_PATH = <your chromedriver path>
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
wd.get(url)
wait = WebDriverWait(wd, 100)
time.sleep(40)
v_title = wd.find_element_by_xpath('//*[#id="container"]/h1/yt-formatted-string').text
print("title Is ")
print(v_title)
comments_xpath = '//h2[#id="count"]/yt-formatted-string/span[1]'
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
#wd.find_element_by_xpath(comments_xpath)
print(len(v_comm_cnt))
I get the following error:
selenium.common.exceptions.TimeoutException: Message:
I get correct value for title but not for comment_cnt. Can any one please guide me what is wrong with my code?
Please note that comments count path - //h2[#id="count"]/yt-formatted-string/span[1] point to correct place if I search the value in inspect element.

Updated answer
Well, it was tricky!
There are several issues here:
This page has some bad java scripts on it making the Selenium webdriver driver.get() method to wait until the timeout for the page loaded while it looks like the page is loaded. To overcome that I used Eager page load strategy.
This page has several blocks of code for the same areas so as sometimes one of them is used (visible) and sometimes the second. This makes working with element locators difficultly. So, here I am waiting for visibility of title element from one of that blocks. In case it was visible - I'm extracting the text from there, otherwise I'm waiting for the visibility of the second element (it comes immediately) and extracting the text from there.
There are several ways to make page scrolling. Not all of them worked here. I found the one that is working and scrolling not too much.
The code below is 100% working, I run it several times.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("--start-maximized")
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "eager"
s = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, desired_capabilities=caps, service=s)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 10)
title_xpath = "//div[#class='style-scope ytd-video-primary-info-renderer']/h1"
alternative_title = "//*[#id='title']/h1"
v_title = ""
try:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, title_xpath))).text
except:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, alternative_title))).text
print("Title is " + v_title)
comments_xpath = "//div[#id='title']//*[#id='count']//span[1]"
driver.execute_script("window.scrollBy(0, arguments[0]);", 600)
try:
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
except:
pass
v_comm_cnt = driver.find_element(By.XPATH, comments_xpath).text
print("Video has " + v_comm_cnt + " comments")
The output is:
Title is Music for when you are stressed 🍀 Chil lofi | Music to Relax, Drive, Study, Chill
Video has 834 comments
Process finished with exit code 0

Getting Trouble in Dismissing Ads from Website using Selenium

I am trying to do web automation where I am using selenium library to moves towards one page for finding title of that page but when I am trying to click on find button suddenly ads pop up and it disturbs the flow and it will not allow the find button to click on it. Let me know that how can I close that ad so that I can move towards the next page and get the tile of that page.
Here is my code:
#Using Selenium to move towards the next pages by clicking on button
#Libs Included
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
#Path to Chrome Driver
path='chromedriver.exe'
driver=webdriver.Chrome(path)
#Main_Url Page
main_url='https://www.zameen.com/'
#Getting the MainPage
driver.get(main_url)
print(driver.title)
#Selecting the Drop Down Menu First
search=driver.find_element_by_class_name('eedc221b').click()
#How To Move to Specific Area using Finding Box To get All the List of Cities
list_of_cities=[]
Cities=driver.find_elements_by_class_name("d92d11c7")
#print(Cities)
for i in Cities:
city=i.text
list_of_cities.append(city)
#print("List of Cities are: \n",list_of_cities)
#Reach towards the first Location by sending the citname to the combobox and then hit enter
driver.find_element_by_css_selector("button[aria-label='"+Cities[0].text+"']").click()
time.sleep(3)
driver.find_element_by_css_selector("a[aria-label='Find button'][class='c3901770 _22dc5e0a']").click()
try:
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.TAG_NAME,"html")))
print("Tilte of next Page is: {0}".format(driver.title))
time.sleep(5)
driver.quit()
finally:
driver.quit()

That add close button can be identified with the help of below css selector :
# Path to Chrome Driver
path = 'chromedriver.exe'
driver = webdriver.Chrome(path)
wait = WebDriverWait(driver, 10)
# Main_Url Page
main_url = 'https://www.zameen.com/'
driver.maximize_window()
# Getting the MainPage
driver.get(main_url)
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "img.close_cross_big"))).click()
except:
print("could not click")
pass
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
and then you can continue with the rest of your code.

Loop is not working properly for Selenium Python

i'm trying to run the following piece of code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))
when I run this code, the newslinks list is empty at the end of execution. But if I run it line by line, by assigning the value of 'k' one by one, on my own, it runs successfully.
Where am I going wrong in the logic. Please help.

It seems there is too much redundant code. I would suggest use either linear xpath or css selector to identify the elements.
However some of the pages the new link not appeared you need to handle this using try..except.
Since you need to navigate each url I would suggest use explicit wait WebDriverWait()
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver=webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get("https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx")
allyears=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years a")))
yearsurl=[url.get_attribute("href") for url in allyears]
newslinks = list()
for yr in yearsurl:
driver.get(yr)
try:
for element in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"div.link-detailpage >a"))):
newslinks.append(element.get_attribute("href"))
except:
continue
print(newslinks)
OutPut:
['https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/a-problem-solved-at-a-rate-of-knots-the-latest-Top-Story-available-on-CNHIndustrial-com.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-acquires-a-minority-stake-in-Augmenta.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-presents-YOUNIVERSE.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/Calling-of-the-Annual-General-Meeting.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-completes-minority-investment-in-Monarch-Tractor.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-N-V--announces-the-extension-by-one-additional-year-to-March-2026-of-its-syndicated-credit-facility.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Working-for-a-safer-future-with-World-Class-Manufacturing.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Behind-the-Wheel-CNH-Industrial-supports-the-growing-hemp-industry-in-North-America.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-employees-in-Italy-to-receive-contractual-bonus-for-2020-results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/2020-Fourth-Quarter-and-Full-Year-Results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/The-Iveco-Defence-Vehicles-plant-in-Sete-Lagoas,-Brazil-and-the-New-Holland-Agriculture-facility-in-Croix,-France.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-to-announce-2020-Fourth-Quarter-and-Full-Year-financial-results-on-February-3-2021.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-publishes-its-2021-Corporate-Calendar.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/Iveco-Defence-Vehicles-supplies-third-generation-protected-military-GTF8x8-(ZLK-15t)-trucks-to-the-German-Army.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/STEYR-New-Holland-Agriculture-CASE-Construction-Equipment-and-FPT-Industrial-win-prestigious-2020-Good-Design%C2%AE-Awards.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-completes-the-acquisition-of-four-divisions-of-CEG-in-South-Africa.aspx',so on...]
Update:
If you don't want use webdriverwait which is best practice then use time.sleep() since page needs some time to load and element should be visible before interacting it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
time.sleep(5)
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_elements_by_xpath('//div[#id="ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years"]//a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
print(years_urls)
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
time.sleep(3)
news = driver.find_elements_by_xpath('//div[#class="link-detailpage"]/a')
for j in range(len(news)):
newslinks.append(news[j].get_attribute('href'))
print(newslinks)

There is a popup asking you to accept cookies that you need to click beforehand.
Add this to your script:
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
So the final result will be:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
# this part is added, together with the necessary imports
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
# years_elements = driver.find_element_by_css_selector("#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years")
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))

Selenium after clicking load more i cant get the newly loaded content

I am working on a project where am required to fetch data from a site using selenium.
The website has a load more clickable div.
i have managed to make selenium click the div and it works you can see it do the clicking when its running on none --headless mode
However when i try to get all the items i don't get the newly loaded
items after clicking.
Here is my code snippet
driver.get('https://jamboshop.com/search/tv')
i=1
maximum=4
while i<maximum:
try:
i += 1
el=driver.find_element_by_css_selector("div.showMoreLoaderPanel")
action=ActionChains(driver)
action.move_to_element(el).click().perform()
driver.implicitly_wait(3)
except:
break
products =driver.find_elements_by_css_selector("div.col-xs-6.col-sm-4.col-md-4.col-lg-3")
for product in products:
print({"item_name":product.find_element_by_css_selector("h6.prd-title").text})
This only prints the items that were present before the clicks...how do i get all the items in the page including ones loaded after clicking load more?
extra
# My imports and chrome settings
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=chrome_options)

I think this is a lazy loading application.So when go bottom of the page it seems lost the previous elements it has capture and that is why you can see only current elements on the page available.
There is an alternative way to handle this by checking with a list and then capture those data while iterating the while loop.
Code:
import time
driver.get('https://jamboshop.com/search/tv')
i=1
maximum=4
itemlist=[]
while i<maximum:
try:
products = driver.find_elements_by_css_selector("div.col-xs-6.col-sm-4.col-md-4.col-lg-3")
for product in products:
if product.find_element_by_css_selector("h6.prd-title").text in itemlist:
continue
else:
itemlist.append(product.find_element_by_css_selector("h6.prd-title").text)
i += 1
el=driver.find_element_by_css_selector("div.showMoreLoaderPanel")
action=ActionChains(driver)
action.move_to_element(el).click().perform()
time.sleep(3)
except:
break
print(len(itemlist))
print(itemlist)
Let me know if this works for you.Website is not accessible at my end.

Web page is loaded in selenium and reaches to end but does not contain all the elements inside the div

This is the site.
https://www.talabat.com/uae/top-selling.
There are somewhat 100 products and only 30 gets loaded. I was trying to fetch all the links and page reaches to end but only display 30 products and when clicked somewhere in the webdriver then loads the rest of the products. How can I print the links of all the products?
Thanks in advance!!
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup
HOME_PAGE_URL = "https://www.talabat.com/uae/top-selling"
PATIENCE_TIME = 60
LOAD_MORE_XPATH = '//*[#id="comment-ajx"]/div'
driver = webdriver.Chrome(executable_path='C:\\Users\\Mansi Dhingra\\Downloads\\chromedriver.exe')
driver.get(HOME_PAGE_URL)
soup=BeautifulSoup(driver.page_source)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# sleep for 30s
res=[]
results = driver.find_elements_by_xpath("/html/body/div[3]/ui-view/section/div/div[2]/div/div[2]/div/div[2]")
html_code = driver.find_element_by_tag_name("section").text
print(html_code)
for res in results:
link=res.find_elements_by_tag_name('a')
for x in link:
product_link = x.get_attribute("href")
print(product_link)
print(results)

The main point is that selenium reads the page before the page has loaded all the items, you need a wait.
Just read the docs:
https://selenium-python.readthedocs.io/waits.html
Choose the best condition for you case and go for it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium Web Scraping: Find element by text not working in script - python

Related

How to extract the comments count correctly

Getting Trouble in Dismissing Ads from Website using Selenium

Loop is not working properly for Selenium Python

Selenium after clicking load more i cant get the newly loaded content

Web page is loaded in selenium and reaches to end but does not contain all the elements inside the div

Categories

Resources