webscraping: missing href attribute - simulate mouse clicks for webscraping necessary?

webscraping: missing href attribute - simulate mouse clicks for webscraping necessary? - python

For a fun webscraping project, I want to collect NHL data from ttps://www.nhl.com/stats/teams.
There is a clickable Excel Export tag which I can find using selenium and bs4.
Unfortunately, this is where it ends:
Since there is no href attribute it seems that I cannot access the data.
I got what I wanted by using pynput to simulatie a mouseclick, but I wonder:
Could I do that differently? If feels so clumsy.
-> the tag with the Export Icon can be found here :
a class="styles__ExportIcon-sc-16o6kz0-0 dIDMgQ"
-> Here is my code
`import pynput
from pynput.mouse import Button, Controller
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path = 'somepath\chromedriver.exe')
URL = 'https://www.nhl.com/stats/teams'
driver.get(URL)
html = driver.page_source # DOM with JavaScript execution complete
soup = BeautifulSoup(html)
body = soup.find('body')
print(body.prettify())
mouse = Controller()
time.sleep(5) # Sleep for 5 seconds until page is loaded
mouse.position = (1204, 669) # thats where the icon is on my screen
mouse.click(Button.left, 1) # executes download`

There is no href attribute, download is triggert via JS. While working with selenium find your element and use .click() to download the file:
driver.find_element(By.CSS_SELECTOR,'h2>a').click()
Used css selectors here to get direct children <a> of the <h2> or select it directly by class starting with styles__ExportIcon:
driver.find_element(By.CSS_SELECTOR,'a[class^="styles__ExportIcon"]').click()
Example
You may have to deal with onetrust banner, so click it first and than download the sheet.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://www.nhl.com/stats/teams'
driver.get(url)
driver.find_element(By.CSS_SELECTOR,'#onetrust-reject-all-handler').click()
driver.find_element(By.CSS_SELECTOR,'h2>a').click()

Related

Is it possible to use selenium to click this button?

The xpath is correct and I've tried both variations but it returns NoneType
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pyautogui
# Set up a webdriver instance using the Chrome browser
driver = webdriver.Chrome()
# Navigate to the staking page
driver.get('https://staking.chain.link/')
# Wait for the page to load
time.sleep(15)
# Find the "Next" button
next_button = driver.find_element_by_xpath('/html/body/div[1]/main/div/div[3]/div[3]/button[2]')
# Click the "Next" button to retrieve the next 10 results
next_button.click()
I want to be able to click through the table displaying the oracles answers. Is it possible to use Selenium to click this button?
I also tried:
next_button = class_element= driver.find_element_by_class('button_secondary__WYcyn paragraph-100')

The find_element_by_xpath has been deprecated in newer versions of Selenium
add this import
from selenium.webdriver.common.by import By
change this line
next_button = driver.find_element(By.XPATH, '/html/body/div[1]/main/div/div[3]/div[3]/button[2]')
https://selenium-python.readthedocs.io/

How can I get Html of a website as seen on browser?

A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.

You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.

web scraping table with selenium gets only html elements but no content

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:
https://www.erstebank.hr/hr/tecajna-lista
https://www.otpbanka.hr/tecajna-lista
https://www.sberbank.hr/tecajna-lista/
For all 3 websites result is HTML code for the table but without text.
My code is below:
import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime
from selenium import webdriver
PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
print(table)
driver.close()
Please help what am I missing?
Thank you

The Website is taking time to load the data in the table.
Either Apply time.sleep
import time
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...
Or apply Explicit wait such that the rows are loaded in the tabel.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[#class='ng-scope']")))
# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up.
soup = BeautifulSoup(driver.page_source, 'html5lib')
table = soup.find_all('table')
print(table)

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:
# This only works for the Selenium element matcher
driver.implicitly_wait(10)
Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:
# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')
# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')
# BS4 finds no tables as, when the page first loads, there are none.
To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.
driver.implicitly_wait(10)
# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.
The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).
from requests import get
from json import loads
# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text
# Turn to dictionary
data_dictionary = loads(data_as_text)

You can use this as the foundation for further work:-
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
TDCLASS = 'ng-binding'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
driver.get('https://www.erstebank.hr/hr/tecajna-lista')
try:
# There may be a cookie request dialogue which we need to click through
WebDriverWait(driver, 5).until(EC.presence_of_element_located(
(By.ID, 'popin_tc_privacy_button_2'))).click()
except Exception:
pass # Probably timed out so ignore on the basis that the dialogue wasn't presented
# The relevant <td> elements all seem to be of class 'ng-binding' so look for those
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
soup = BS(driver.page_source, 'lxml')
for td in soup.find_all('td', class_=TDCLASS):
print(td)

how to scrape links from hidden span class HTML?

I'm learning web scraping as I scrape real world data from real websites.
Yet, I've never ran into this type of issues until now.
One can usually search for wanted HTML source codes by right-clicking the part of the websites and then clicking inspect option. I'll jump to the example right away to explain the issue.
From the above picture, the red color marked span class is not there originally but when I put(did not even click) my cursor on a user's name, a small box for that user pops up and also that span class shows up. What I ultimately want to scrape is the link address for a user's profile which is embedded inside of that span class.I'm not sure but IF I can parse that span class, I guess I can try to scrape the link address but I keep failing to parse that hidden span class.
I didn't expect that much but my codes of course gave me the empty list because that span class didn't show up when my cursor was not on the user's name. But I show my code to show what I've done.
from bs4 import BeautifulSoup
from selenium import webdriver
#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)
#parse html
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
hidden=soup.find_all("span", class_="ui_overlay ui_popover arrow_left")
print (hidden)
Are there any simple and intuitive ways to parse that hidden span class using selenium? If I can parse it, I may use 'find' function to parse the link address for a user and then loop over all the users to get all the link addresses.
Thank you.
=======================updated the question by adding below===================
To add some more detailed explanations on what I want to retrieve, I want to get the link that is pointed with a red arrow from the below picture. Thank you for pointing out that I need more explanations.
==========================updated code so far=====================
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)
profile=driver.find_element_by_xpath("//div[#class='mainContent']")
profile_pic=profile.find_element_by_xpath("//div[#class='ui_avatar large']")
ActionChains(driver).move_to_element(profile_pic).perform()
ActionChains(driver).move_to_element(profile_pic).click().perform()
#So far I could successfully hover over the first user. A few issues occur after this line.
#The error message says "type object 'By' has no attribute 'xpath'". I thought this would work since I searched on the internet how to enable this function.
waiting=wait(driver, 5).until(EC.element_to_be_clickable((By.xpath,('//span//a[contains(#href,"/Profile/")]'))))
#This gives me also a error message saying that "unable to locate the element".
#Some of the ways to code in Python and Java were different so I searched how to get the value of the xpath which contains "/Profile/" but gives me an error.
profile_box=driver.find_element_by_xpath('//span//a[contains(#href,"/Profile/")]').get_attribute("href")
print (profile_box)
Also, is there any way to iterate through xpath in this case?

I think you can use requests library instead of selenium.
When you hover on username, you will get Request URL as below.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html')
print(html.status_code)
soup = BeautifulSoup(html.content, 'html.parser')
# Find all UID of username
# Split the string "UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293" into UID, SRC
# And recombine to Request URL
name = soup.find_all('div', class_="memberOverlayLink")
for i in name:
print(i.get('id'))
# Use url to get profile link
response = requests.get('https://www.tripadvisor.com/MemberOverlay?Mode=owa&uid=805E0639C29797AEDE019E6F7DA9FF4E&c=&src=507403702&fus=false&partner=false&LsoId=&metaReferer=')
soup = BeautifulSoup(response.content, 'html.parser')
result = soup.find('a')
print(result.get('href'))
This is output:
200
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
/Profile/JLERPercy
If you want to use selenium to get popup box,
You can use ActionChains to do hover() function.
But I think it's less efficient than using requests.
from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).move_to_element(element).perform()

Python
The below code will extract the href value.Try and let me know how it goes.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome('/usr/local/bin/chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");
#finds all the comments or profile pics
profile_pic= driver.find_elements(By.XPATH,"//div[#class='prw_rup prw_reviews_member_info_hsx']//div[#class='ui_avatar large']")
for i in profile_pic:
#clicks all the profile pic one by one
ActionChains(driver).move_to_element(i).perform()
ActionChains(driver).move_to_element(i).click().perform()
#print the href or link value
profile_box=driver.find_element_by_xpath('//span//a[contains(#href,"/Profile/")]').get_attribute("href")
print (profile_box)
driver.quit()
Java example:
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.interactions.Actions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class Selenium {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "./lib/chromedriver");
WebDriver driver = new ChromeDriver();
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");
//finds all the comments or profiles
List<WebElement> profile= driver.findElements(By.xpath("//div[#class='prw_rup prw_reviews_member_info_hsx']//div[#class='ui_avatar large']"));
for(int i=0;i<profile.size();i++)
{
//Hover on user profile photo
Actions builder = new Actions(driver);
builder.moveToElement(profile.get(i)).perform();
builder.moveToElement(profile.get(i)).click().perform();
//Wait for user details pop-up
WebDriverWait wait = new WebDriverWait(driver, 10);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//span//a[contains(#href,'/Profile/')]")));
//Extract the href value
String hrefvalue=driver.findElement(By.xpath("//span//a[contains(#href,'/Profile/')]")).getAttribute("href");
//Print the extracted value
System.out.println(hrefvalue);
}
//close the browser
driver.quit();
}
}
output
https://www.tripadvisor.com/Profile/861kellyd
https://www.tripadvisor.com/Profile/JLERPercy
https://www.tripadvisor.com/Profile/rayn817
https://www.tripadvisor.com/Profile/grossla
https://www.tripadvisor.com/Profile/kapmem

Can't View Complete Page Source in Selenium

When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()

The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)

Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

webscraping: missing href attribute - simulate mouse clicks for webscraping necessary? - python

Related

Is it possible to use selenium to click this button?

How can I get Html of a website as seen on browser?

web scraping table with selenium gets only html elements but no content

how to scrape links from hidden span class HTML?

Can't View Complete Page Source in Selenium

Categories

Resources