find certain text in page source selenium python

find certain text in page source selenium python - python

I am using selenium to build a texting program for a website. At the moment I'm trying to find certain text in a page. This is what I have tried so far.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.blackhempfamily.com/what-are-the-effects-of-cbd")
a = ActionChains(driver)
driver.maximize_window()
time.sleep(5)
if "What is Hemp & CBD?" in driver.page_source:
result = 1
else:
result = 0
print(result)
Every time I run it instead of giving me 1 it gives me 0 but it's clear that the text is in the site in big bold letters.

You should try changing the if statement to
if "What is Hemp & CBD?" in driver.page_source:
result = 1
else:
result = 0
because driver.page_source gets the letters not any symbols

try this:
matched = driver.execute_script('''
return !!document.body.innerText.match('What is Hemp & CBD?')
''')
note that if you change that to innerHTML.match it will fail. Why? Because the & in HTML will be &(amp;)

Related

Selenium Web Scraping: Find element by text not working in script

I am working on a script to gather information off Newegg to look at changes over time in graphics card prices. Currently, my script will open a Newegg search on RTX 3080's through Chromedriver and then click on the link for Desktop Graphics Cards to narrow down my search. The part that I am struggling with is developing a for item in range loop that will let me iterate through all 8 search result pages. I know that I could do this by simply changing the page number in the URL, but as this is an exercise that I'm trying to use to learn Relative Xpath better, I want to do it using the Pagination buttons at the bottom of the page. I know that each button should contain inner text of "1,2,3,4 etc." but whenever I use text() = {item} in my for loop, it doesn't click the button. The script runs and doesn't return any exceptions, but doesn't do what I want it too. Below I have attached the HTML for the page as well as my current script. Any suggestions or hints are appreciated.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(9):
try:
#next_page_click = driver.find_element(By.XPATH("//button[text() = '{item + 1}']"))
print(next_page_click)
next_page_click.click()
except:
pass

The pagination buttons are out of the initially visible area.
In order to click these elements you will have to scroll the page until the element appears.
Also, you will need to click next page buttons starting from 2 up to 9 (including) while you trying to do this with numbers from 1 up to 9.
I think this should work better:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
actions = ActionChains(driver)
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(2,10):
try:
next_page_click = driver.find_element(By.XPATH(f"//button[text() = '{item}']"))
actions.move_to_element(next_page_click).perform()
time.sleep(2)
#print(next_page_click) - printing a web element itself will not give you usable information
next_page_click.click()
#let the next page loaded, it takes some time
time.sleep(5)
except:
pass

Web scraping text returns 0

Whenever I try to scrape a number from a website and print it always returns 0 even if I delay it to let the window load first.
Here's my code,
from selenium import webdriver
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
browser.get(url)
text = browser.find_element_by_xpath('//*[#id="stat_totalPlayers"]').text
time.sleep(10)
print(text)
All I need it to do is print some text that it takes from a website.
Have I done something wrong or am I just completely missing something?

You should put the delay before getting the element!
from selenium import webdriver
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
browser.get(url)
time.sleep(10)
text = browser.find_element_by_xpath('//*[#id="stat_totalPlayers"]').text
print(text)
While it's better to use explicit wait, like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
wait = WebDriverWait(driver, 20)
browser.get(url)
text = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="stat_totalPlayers"]'))).text
print(text)

Loop is not working properly for Selenium Python

i'm trying to run the following piece of code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))
when I run this code, the newslinks list is empty at the end of execution. But if I run it line by line, by assigning the value of 'k' one by one, on my own, it runs successfully.
Where am I going wrong in the logic. Please help.

It seems there is too much redundant code. I would suggest use either linear xpath or css selector to identify the elements.
However some of the pages the new link not appeared you need to handle this using try..except.
Since you need to navigate each url I would suggest use explicit wait WebDriverWait()
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver=webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get("https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx")
allyears=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years a")))
yearsurl=[url.get_attribute("href") for url in allyears]
newslinks = list()
for yr in yearsurl:
driver.get(yr)
try:
for element in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"div.link-detailpage >a"))):
newslinks.append(element.get_attribute("href"))
except:
continue
print(newslinks)
OutPut:
['https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/a-problem-solved-at-a-rate-of-knots-the-latest-Top-Story-available-on-CNHIndustrial-com.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-acquires-a-minority-stake-in-Augmenta.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-presents-YOUNIVERSE.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/Calling-of-the-Annual-General-Meeting.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-completes-minority-investment-in-Monarch-Tractor.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-N-V--announces-the-extension-by-one-additional-year-to-March-2026-of-its-syndicated-credit-facility.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Working-for-a-safer-future-with-World-Class-Manufacturing.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Behind-the-Wheel-CNH-Industrial-supports-the-growing-hemp-industry-in-North-America.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-employees-in-Italy-to-receive-contractual-bonus-for-2020-results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/2020-Fourth-Quarter-and-Full-Year-Results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/The-Iveco-Defence-Vehicles-plant-in-Sete-Lagoas,-Brazil-and-the-New-Holland-Agriculture-facility-in-Croix,-France.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-to-announce-2020-Fourth-Quarter-and-Full-Year-financial-results-on-February-3-2021.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-publishes-its-2021-Corporate-Calendar.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/Iveco-Defence-Vehicles-supplies-third-generation-protected-military-GTF8x8-(ZLK-15t)-trucks-to-the-German-Army.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/STEYR-New-Holland-Agriculture-CASE-Construction-Equipment-and-FPT-Industrial-win-prestigious-2020-Good-Design%C2%AE-Awards.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-completes-the-acquisition-of-four-divisions-of-CEG-in-South-Africa.aspx',so on...]
Update:
If you don't want use webdriverwait which is best practice then use time.sleep() since page needs some time to load and element should be visible before interacting it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
time.sleep(5)
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_elements_by_xpath('//div[#id="ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years"]//a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
print(years_urls)
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
time.sleep(3)
news = driver.find_elements_by_xpath('//div[#class="link-detailpage"]/a')
for j in range(len(news)):
newslinks.append(news[j].get_attribute('href'))
print(newslinks)

There is a popup asking you to accept cookies that you need to click beforehand.
Add this to your script:
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
So the final result will be:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
# this part is added, together with the necessary imports
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
# years_elements = driver.find_element_by_css_selector("#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years")
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))

Getting an empty list when scraping with Selenium

I am trying to create a python function that can scrape the article titles of a search result on Popular Science's website.
I have written this code, which has worked for a similar science-related website but when I run it specifically for Popular Science, it returns an empty list.
Code:
from selenium import webdriver
import pandas as pd
def scraper(text):
driver = webdriver.Chrome(executable_path='chromedriver.exe')
wired_dict = []
driver.get("https://www.popsci.com/search-results/" + text + "/")
search = driver.find_elements_by_class_name("siq-partner-result")
for words in search:
wired_dict.append(words.text)
return wired_dict
print(scraper("science"))

You can use driver.implicitly_wait(10) for wait while page is loaded.
from selenium import webdriver
def scrapper(text):
driver = webdriver.Chrome('./chromedriver')
driver.get(f"https://www.popsci.com/search-results/{text}/")
driver.implicitly_wait(10)
search = driver.find_elements_by_class_name("siq-partner-result")
wired_dict = [word.text for word in search]
print(wired_dict)
scrapper('sample')

This page takes a while to load. You are using driver.find_elements_by_class_name before the page has finished loading, so it's not finding those elements.
You can test this theory by import time and time.sleep(5) just before the search code.
The best solution is to keep checking until the elements are loaded with WebDriverWait() wait until the elements have loaded.
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
def scraper(text):
driver = webdriver.Chrome(executable_path='chromedriver.exe')
wired_dict = []
driver.get("https://www.popsci.com/search-results/" + text + "/")
delay = 3
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'siq-partner-result')))
search = driver.find_elements_by_class_name("siq-partner-result")
for words in search:
wired_dict.append(words.text)
return wired_dict

You can use WebDriverWait for the desired element to visible and then try to find the elements.
Using XPATH :
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//*[#class='siq-partner-result']")))
search = driver.find_elements_by_class_name("siq-partner-result")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Scraping a specific table in selenium

I am trying to scrape a table found inside a div on a page.
Basically here's my attempt so far:
# NOTE: Download the chromedriver driver
# Then move exe file on C:\Python27\Scripts
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
driver = webdriver.Chrome()
driver.implicitly_wait(10)
URL_start = "http://www.google.us/trends/explore?"
date = '&date=today%203-m' # Last 90 days
location = "&geo=US"
symbol = sys.argv[1]
query = 'q='+symbol
URL = URL_start+query+date+location
driver.get(URL)
table = driver.find_element_by_xpath('//div[#class="line-chart"]/table/tbody')
print table.text
If I run the script, with an argument like "stackoverflow" I should be able to scrape this site: https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow
Apparently the xpath I have there is not working, the program is not printing anything, it's just plain blank.
I am basically in need on the values of the chart that appears on that website. And those values (and dates) are inside a table, here is a screenshot:
Could you help me locate the correct xpath of the table to retrieve those values using selenium on python?
Thanks in advance!

you can use Xpath As Follow:
//div[#class="line-chart"]/div/div[1]/div/div/table/tbody/tr
Here I will Refine my answer and make some changes in your code not it's work.
# NOTE: Download the chromedriver driver
# Then move exe file on C:\Python27\Scripts
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
from lxml.html import fromstring,tostring
driver = webdriver.Chrome()
driver.implicitly_wait(20)
'''
URL_start = "http://www.google.us/trends/explore?"
date = '&date=today%203-m' # Last 90 days
location = "&geo=US"
symbol = sys.argv[1]
query = 'q='+symbol
URL = URL_start+query+date+location
'''
driver.get("https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow")
table_trs = driver.find_elements_by_xpath('//div[#class="line-chart"]/div/div[1]/div/div/table/tbody/tr')
for tr in table_trs:
#print tr.get_attribute("innerHTML").encode("UTF-8")
td = tr.find_elements_by_xpath(".//td")
if len(td)==2:
print td[0].get_attribute("innerHTML").encode("UTF-8") +"\t"+td[1].get_attribute("innerHTML").encode("UTF-8")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find certain text in page source selenium python - python

You should try changing the if statement to if "What is Hemp & CBD?" in driver.page_source: result = 1 else: result = 0 because driver.page_source gets the letters not any symbols

try this: matched = driver.execute_script(''' return !!document.body.innerText.match('What is Hemp & CBD?') ''') note that if you change that to innerHTML.match it will fail. Why? Because the & in HTML will be &(amp;)

Related

Selenium Web Scraping: Find element by text not working in script

Web scraping text returns 0

Loop is not working properly for Selenium Python

Getting an empty list when scraping with Selenium

Scraping a specific table in selenium

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find certain text in page source selenium python - python

You should try changing the if statement to if "What is Hemp & CBD?​​" in driver.page_source: result = 1 else: result = 0 because driver.page_source gets the letters not any symbols

try this: matched = driver.execute_script(''' return !!document.body.innerText.match('What is Hemp & CBD?') ''') note that if you change that to innerHTML.match it will fail. Why? Because the & in HTML will be &(amp;)

Related

Selenium Web Scraping: Find element by text not working in script

Web scraping text returns 0

Loop is not working properly for Selenium Python

Getting an empty list when scraping with Selenium

Scraping a specific table in selenium

Categories

Resources

You should try changing the if statement to if "What is Hemp & CBD?" in driver.page_source: result = 1 else: result = 0 because driver.page_source gets the letters not any symbols