Scraping a specific table in selenium

Scraping a specific table in selenium - python

I am trying to scrape a table found inside a div on a page.
Basically here's my attempt so far:
# NOTE: Download the chromedriver driver
# Then move exe file on C:\Python27\Scripts
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
driver = webdriver.Chrome()
driver.implicitly_wait(10)
URL_start = "http://www.google.us/trends/explore?"
date = '&date=today%203-m' # Last 90 days
location = "&geo=US"
symbol = sys.argv[1]
query = 'q='+symbol
URL = URL_start+query+date+location
driver.get(URL)
table = driver.find_element_by_xpath('//div[#class="line-chart"]/table/tbody')
print table.text
If I run the script, with an argument like "stackoverflow" I should be able to scrape this site: https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow
Apparently the xpath I have there is not working, the program is not printing anything, it's just plain blank.
I am basically in need on the values of the chart that appears on that website. And those values (and dates) are inside a table, here is a screenshot:
Could you help me locate the correct xpath of the table to retrieve those values using selenium on python?
Thanks in advance!

you can use Xpath As Follow:
//div[#class="line-chart"]/div/div[1]/div/div/table/tbody/tr
Here I will Refine my answer and make some changes in your code not it's work.
# NOTE: Download the chromedriver driver
# Then move exe file on C:\Python27\Scripts
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
from lxml.html import fromstring,tostring
driver = webdriver.Chrome()
driver.implicitly_wait(20)
'''
URL_start = "http://www.google.us/trends/explore?"
date = '&date=today%203-m' # Last 90 days
location = "&geo=US"
symbol = sys.argv[1]
query = 'q='+symbol
URL = URL_start+query+date+location
'''
driver.get("https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow")
table_trs = driver.find_elements_by_xpath('//div[#class="line-chart"]/div/div[1]/div/div/table/tbody/tr')
for tr in table_trs:
#print tr.get_attribute("innerHTML").encode("UTF-8")
td = tr.find_elements_by_xpath(".//td")
if len(td)==2:
print td[0].get_attribute("innerHTML").encode("UTF-8") +"\t"+td[1].get_attribute("innerHTML").encode("UTF-8")

Related

Loop is not working properly for Selenium Python

i'm trying to run the following piece of code :
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))
when I run this code, the newslinks list is empty at the end of execution. But if I run it line by line, by assigning the value of 'k' one by one, on my own, it runs successfully.
Where am I going wrong in the logic. Please help.

It seems there is too much redundant code. I would suggest use either linear xpath or css selector to identify the elements.
However some of the pages the new link not appeared you need to handle this using try..except.
Since you need to navigate each url I would suggest use explicit wait WebDriverWait()
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver=webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get("https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx")
allyears=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years a")))
yearsurl=[url.get_attribute("href") for url in allyears]
newslinks = list()
for yr in yearsurl:
driver.get(yr)
try:
for element in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"div.link-detailpage >a"))):
newslinks.append(element.get_attribute("href"))
except:
continue
print(newslinks)
OutPut:
['https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/a-problem-solved-at-a-rate-of-knots-the-latest-Top-Story-available-on-CNHIndustrial-com.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-acquires-a-minority-stake-in-Augmenta.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-presents-YOUNIVERSE.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/Calling-of-the-Annual-General-Meeting.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-completes-minority-investment-in-Monarch-Tractor.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-N-V--announces-the-extension-by-one-additional-year-to-March-2026-of-its-syndicated-credit-facility.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Working-for-a-safer-future-with-World-Class-Manufacturing.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Behind-the-Wheel-CNH-Industrial-supports-the-growing-hemp-industry-in-North-America.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-employees-in-Italy-to-receive-contractual-bonus-for-2020-results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/2020-Fourth-Quarter-and-Full-Year-Results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/The-Iveco-Defence-Vehicles-plant-in-Sete-Lagoas,-Brazil-and-the-New-Holland-Agriculture-facility-in-Croix,-France.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-to-announce-2020-Fourth-Quarter-and-Full-Year-financial-results-on-February-3-2021.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-publishes-its-2021-Corporate-Calendar.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/Iveco-Defence-Vehicles-supplies-third-generation-protected-military-GTF8x8-(ZLK-15t)-trucks-to-the-German-Army.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/STEYR-New-Holland-Agriculture-CASE-Construction-Equipment-and-FPT-Industrial-win-prestigious-2020-Good-Design%C2%AE-Awards.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-completes-the-acquisition-of-four-divisions-of-CEG-in-South-Africa.aspx',so on...]
Update:
If you don't want use webdriverwait which is best practice then use time.sleep() since page needs some time to load and element should be visible before interacting it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
time.sleep(5)
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_elements_by_xpath('//div[#id="ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years"]//a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
print(years_urls)
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
time.sleep(3)
news = driver.find_elements_by_xpath('//div[#class="link-detailpage"]/a')
for j in range(len(news)):
newslinks.append(news[j].get_attribute('href'))
print(newslinks)

There is a popup asking you to accept cookies that you need to click beforehand.
Add this to your script:
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
So the final result will be:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
# this part is added, together with the necessary imports
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
# years_elements = driver.find_element_by_css_selector("#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years")
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
url = years_urls[k]
driver.get(url)
#link-detailpage --> id for the newslinks in each year
news = driver.find_elements_by_class_name('link-detailpage')
for j in range(len(news)):
newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))

find certain text in page source selenium python

I am using selenium to build a texting program for a website. At the moment I'm trying to find certain text in a page. This is what I have tried so far.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.blackhempfamily.com/what-are-the-effects-of-cbd")
a = ActionChains(driver)
driver.maximize_window()
time.sleep(5)
if "What is Hemp & CBD?" in driver.page_source:
result = 1
else:
result = 0
print(result)
Every time I run it instead of giving me 1 it gives me 0 but it's clear that the text is in the site in big bold letters.

You should try changing the if statement to
if "What is Hemp & CBD?" in driver.page_source:
result = 1
else:
result = 0
because driver.page_source gets the letters not any symbols

try this:
matched = driver.execute_script('''
return !!document.body.innerText.match('What is Hemp & CBD?')
''')
note that if you change that to innerHTML.match it will fail. Why? Because the & in HTML will be &(amp;)

Element not found on page while scrolling down (StaleElementReferenceException)

I am currently scrolling through a TripAdivsor page to scrape images and need to scroll until the bottom of the page but keep getting the error of:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document.
I am assuming it is because it is trying to go very fast through the page but even when I change the implicit wait time to be larger, it does not solve the issue. I also tried making sure the new location is visible first before parsing through to get the url but that also did not do any good. Any help would be greatly appreciated!
# import dependencies
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re
import selenium
import io
import pandas as pd
import urllib.request
import urllib.parse
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from _datetime import datetime
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.headless=False
driver = webdriver.Chrome("/Users/rishi/Downloads/chromedriver 3")
driver.maximize_window()
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs", prefs)
#open up website
driver.get(
"https://www.tripadvisor.com/Hotel_Review-g28970-d84078-Reviews-Hyatt_Regency_Washington_on_Capitol_Hill-Washington_DC_District_of_Columbia.html#/media/84078/?albumid=101&type=2&category=101")
image_url = []
end = False
while not(end):
#wait until element is found and then store all webelements into list
images = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located(
(By.XPATH, '//*[#class="media-viewer-dt-root-GalleryImageWithOverlay__galleryImage--1Drp0"]')))
#iterate through visible images and acquire their url based on background image style
for index, image in enumerate(images):
image_url.append(images[index].value_of_css_property("background-image"))
#if you are at the end of the page then leave loop
# if(length == end_length):
# end = True
#move to next visible images in the array
driver.execute_script("arguments[0].scrollIntoView();", images[-1])
#time.sleep(1)
#wait until the new web element is visible
driver.implicitly_wait(10)
#WebDriverWait(driver, 20).until(EC.visibility_of_element_located(images[-1]))
#clean the list to provide clear links
for i in range(len(image_url)):
start = image_url[i].find("url(\"") + len("url(\"")
end = image_url[i].find("\")")
print(image_url[i][start:end])

How to use mulitple try except in selenium python

this code when given a list of cities goes and searches on google and extract data then covert it into a dataframe
In some cases have to use different xpaths to extract the data. there are three xpaths in total.
Trying to do this :
if
1 doesnt work go to 2
2 doesnt work go to 3
3 doesnt work.
use driver.quit ()
tried this code used NoSuchElementException
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
import pandas as pd
from selenium.common.exceptions import NoSuchElementException
df_output = pd.DataFrame(columns=["City", "pincode"])
url = "https://www.google.com/"
chromedriver = ('/home/me/chromedriver/chromedriver.exe')
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get(url)
search = driver.find_element_by_name('q')
mlist1=['polasa']
for i in mlist1:
try:
search.send_keys(i,' pincode')
search.send_keys(Keys.RETURN)
WebDriverWait(driver, 10).until(expected_conditions.visibility_of_element_located((By.XPATH, '//div[#class="IAznY"]//div[#class="title"]')))
elmts = driver.find_elements_by_xpath('//div[#class="IAznY"]//div[#class="title"]')
df_output = df_output.append(pd.DataFrame(columns=["City", "pincode"], data=[[i,elmts[0].text]]))
driver.quit()
except NoSuchElementException:
try:
elements=driver.find_element_by_xpath("//div[#class='Z0LcW']")
df_output = df_output.append(pd.DataFrame(columns=["City", "pincode"], data=[[i,elements.text]]))
driver.quit()
except NoSuchElementException:
try:
elements=driver.find_element_by_xpath("//div[#class='Z0LcW AZCkJd']")
df_output = df_output.append(pd.DataFrame(columns=["City", "pincode"], data=[[i,elements.text]]))
driver.quit()
except:
driver.quit()
this code works used one of the 3 tags here
need to combine 3 tags in a single code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
import html5lib
import json
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
import pandas as pd
url = "https://www.google.com/"
chromedriver = ('/home/me/chromedriver/chromedriver.exe')
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get(url)
search = driver.find_element_by_name('q')
search.send_keys('polasa',' pincode')
search.send_keys(Keys.RETURN)
elements=driver.find_element_by_xpath("//div[#class='Z0LcW']")
elements.text
``

You don't really need 3 try-catchs. You can do this without throwing exceptions by locating elements (plural) given a locator and then check the length of the collection returned. If length = 0, no elements were found.
The locators you are using don't require XPath so you can instead use a CSS selector and combine all three with an OR and avoid the three checks. (Note: you can do the same thing with XPath but the results are messier and harder to read)
Here are your 3 locators combined into one using OR (the comma) in CSS selector syntax
div.IAznY div.title, div.Z0LcW, div.Z0LcW.AZCkJd
...and the updated code using the combined locator and without the nested try-catch.
...
locator = (By.CSS_SELECTOR, 'div.IAznY div.title, div.Z0LcW, div.Z0LcW.AZCkJd')
for i in mlist1:
search.send_keys(i,' pincode')
search.send_keys(Keys.RETURN)
WebDriverWait(driver, 10).until(expected_conditions.visibility_of_element_located(*locator)
elements = driver.find_elements_by_css_selector(*locator)
df_output = df_output.append(pd.DataFrame(columns=["City", "pincode"], data=[[i,elements[0].text]]))
driver.quit()
NOTE: I used your original locators and wasn't returning any results with any of the three. Are you sure they are correct?
Also note... I pulled the driver.quit() out of the loop. I'm not sure if you intended it to be inside or not but from the code provided, if the try succeeds in the first iteration, the browser will quit. You only have one item so you probably didn't notice this yet but would have been confused when you added another item to the iteration.

How to select php filter and downloading file with Selenium

I've been trying to select the filter for "pitchers" and download to Excel from here: https://www.rotowire.com/baseball/stats.php
I've tried the following but am getting an error/don't sure how to select the necessary items
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.rotowire.com/baseball/stats.php")
elem = driver.find_elements_by_xpath("//div[contains(#class,'filter-tab is-selected')]")
Ideally (for now), the script runs and downloads the file locally.

This downloads the pitcher data. It seems that there is some type of hidden html in the website. That's why the code finds the entire table first, then the excel button.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("https://www.rotowire.com/baseball/stats.php")
pitchers = driver.find_element_by_xpath("//div[#data-name='P']")
pitchers.click()
player_stats_elem = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//div[#data-pos='p']")))
excel_download = WebDriverWait(player_stats_elem, 5).until(EC.presence_of_element_located((By.XPATH, ".//img[#alt='Excel']/..")))
excel_download.click()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a specific table in selenium - python

Related

Loop is not working properly for Selenium Python

find certain text in page source selenium python

Element not found on page while scrolling down (StaleElementReferenceException)

How to use mulitple try except in selenium python

How to select php filter and downloading file with Selenium

Categories

Resources