Trouble scraping reviews using Selenium in Python

Trouble scraping reviews using Selenium in Python - python

I would like to scrape the reviews from this website: https://www.sephora.com/product/the-porefessional-face-primer-P264900. Here is an example of the syntax I find when I inspect a review:
<div class="css-7rv8g1 " data-comp="Ellipsis Box ">So good! This primer smooths my skin and blurs my pores so well! But, it is pretty mattifying so if you want a dewy look, this might not be for you.</div>
I have tried the following code, which returns an empty list:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('/…/chromedriver')
url = 'https://www.sephora.com/product/the-porefessional-face-primer-P264900'
driver.get(url)
reviews = driver.find_elements_by_xpath("//div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box']")
I have tried calling other find_elements methods on driver without success. I have also tried the solution outlined at this answer, but got a TimeoutException from running the following code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get(url)
driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH, "//div[#id='tabpanel0']/div//b[contains(., 'What Else You Need to Know')]"))))
reviews = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box']")))
How can I use Selenium to scrape reviews from this page on Sephora’s website?

You need to add a space in your xpath. You have 'Ellipsis Box' when it should be 'Ellipsis Box '
//div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box ']
I was able to find 6 elements using the corrected xpath.

Related

Use Python to Scrape for Data in Family Search Records

I am trying to scrape the following record table in familysearch.org. I am using the Chrome webdriver with Python, using BeautifulSoup and Selenium.
Upon inspecting the page I am interested in, I wanted to scrape from the following bit in HTML. Note this is only one element part of a familysearch.org table that has 100 names.
<span role="cell" class="td " name="name" aria-label="Name"> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <span><sr-cell-name name="Jame Junior " url="ZS" relationship="Principal" collection-name="Index"></sr-cell-name></span> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> </span>
Alternatively, the name also shows in this bit of HTML
<a class="name" href="/ark:ZS">Jame Junior </a>
From all of this, I only want to get the name "Jame Junior", I have tried using driver.find.elements_by_class_name("name"), but it prints nothing.
This is the code I used
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from getpass import getpass
username = input("Enter Username: " )
password = input("Enter Password: ")
chrome_path= r"C:\Users...chromedriver_win32\chromedriver.exe"
driver= webdriver.Chrome(chrome_path)
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=...")
usernamet = driver.find_element_by_id("userName")
usernamet.send_keys(username)
passwordt = driver.find_element_by_id("password")
passwordt.send_keys(password)
login = driver.find_element_by_id("login")
login.submit()
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=.....")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
#for tag in driver.find_elements_by_class_name("name"):
# print(tag.get_attribute('innerHTML'))
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])

Try to access the sr-cell-name tag.
Selenium:
for tag in driver.find_elements_by_tag_name("sr-cell-name"):
print(tag.get_attribute("name"))
BeautifulSoup:
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
EDIT: You might need to wait for the element to fully appear on the page before parsing it. You can do this using the presence_of_element_located method:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
for tag in driver.find_elements_by_class_name("name"):
print(tag.get_attribute('innerHTML'))

I was looking to do something very similar and have semi-decent python/selenium scraping experience. Long story short, FamilySearch (and many other sites, I'm sure) use some kind of technology (I'm not a JS or web guy) that involves shadow host. The tags are essentially invisible to BS or Selenium.
Solution: pyshadow
https://github.com/sukgu/pyshadow
You may also find this link helpful:
How to handle elements inside Shadow DOM from Selenium
I have now been able to successfully find elements I couldn't before, but am still not all the way where I'm trying to get. Good luck!

Selenium not able to find particular elements on slow loading page

I am attempting to scrape the website basketball-reference and am running into an issue I can't seem to solve. I am trying to grab the box score element for each game played. This is something I was able to easily do with urlopen but b/c other portions of the site require Selenium I thought I would rewrite the entire process with Selenium
Issue seems to be that even if I wait to scrape until I to see the first element load using WebDriverWait, when I then move forward to grabbing the elements I get nothing returned.
One thing I found interesting is if I did a full site print using my results from urlopen w/ something like print (uClient.read()) I would get roughly 300 more lines of html after beautifying compared to doing the same with print (driver.page_source). Even if I put an ImplicitlyWait set for 5 minutes.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.wait = WebDriverWait(driver, 10)
driver.get('https://www.basketball-reference.com/boxscores/')
driver.wait.until(EC.presence_of_element_located((By.XPATH,'//*[#id="content"]/div[3]/div[1]')))
box = driver.find_elements_by_class_name('game_summary expanded nohover')
print (box)
driver.quit()

Try the below code, it is working in my computer. Do let me know if you still face problem.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.wait = WebDriverWait(driver, 60)
driver.get('https://www.basketball-reference.com/boxscores/')
driver.wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="content"]/div[3]/div[1]')))
boxes = driver.wait.until(
EC.presence_of_all_elements_located((By.XPATH, "//div[#class=\"game_summary expanded nohover\"]")))
print("Number of Elements Located : ", len(boxes))
for box in boxes:
print(box.text)
print("-----------")
driver.quit()
If it resolves your problem then please mark it as answer. Thanks

Actually the site doesn't require selenium at all. All the data is there through a simple requests (it's just in the Comments of the html, would just need to parse that). Secondly, you can grab the box scores quite easily with pandas
import pandas as pd
dfs = pd.read_html('https://www.basketball-reference.com/boxscores/')
for idx, table in enumerate(dfs[:-2]):
print (table)
if (idx+1)%3 == 0:
print("-----------")

How to get selenium web driver on python to find elements on css selectors of a following page?

I am trying to get selenium to web scrape the first paragraph of wiki pages using CSS selectors.
When I run this code, it seems to only select ones from the original web page
https://en.wikipedia.org
and not what I am searching for, in this case 'cats'.
Any help with this would be awesome!
browser = webdriver.Firefox(executable_path='D:\Import Files that I also want backed up\Jupyter Notebooks\Python Projects\Selenium\driverss\geckodriver.exe')
browser.get('https://en.wikipedia.org')
search_elem = browser.find_element_by_css_selector('#searchInput')
search_elem.send_keys('cats')
search_elem.submit()
results_elem = browser.find_element_by_css_selector('p')
print(results_elem.text)
output:
Adventure Time is an American fantasy animated television series created .....

To get the first paragraph text from wiki page.Induce WebDriverWait() and visibility_of_element_located() and following css selector.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox(executable_path='D:\Import Files that I also want backed up\Jupyter Notebooks\Python Projects\Selenium\driverss\geckodriver.exe')
browser.get('https://en.wikipedia.org')
search_elem = browser.find_element_by_css_selector('#searchInput')
search_elem.send_keys('cats')
search_elem.submit()
results_elem=WebDriverWait(browser,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div.mw-parser-output p:nth-of-type(3)")))
print(results_elem.text)

Element Not Clickable - even though it is there

Hoping you can help. I'm relatively new to Python and Selenium. I'm trying to pull together a simple script that will automate news searching on various websites. The primary focus was football and to go and get me the latest Manchester United news from a couple of places and save the list of link titles and URLs for me. I could then look through the links myself and choose anything I wanted to review.
In trying the the independent newspaper (https://www.independent.co.uk/) I seem to have come up against a problem with element not interactable when using the following approaches:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('chromedriver')
driver.get('https://www.independent.co.uk')
time.sleep(3)
#accept the cookies/privacy bit
OK = driver.find_element_by_id('qcCmpButtons')
OK.click()
#wait a few seconds, just in case
time.sleep(5)
search_toggle = driver.find_element_by_class_name('icon-search.dropdown-toggle')
search_toggle.click()
This throws the selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable error
I've also tried with XPATH
search_toggle = driver.find_element_by_xpath('//*[#id="quick-search-toggle"]')
and I also tried ID.
I did a lot of reading on here and then also tried using WebDriverWait and execute_script methods:
element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[#id="quick-search-toggle"]')))
driver.execute_script("arguments[0].click();", element)
This didn't seem to error but the search box never appeared, i.e. the appropriate click didn't happen.
Any help you could give would be fantastic.
Thanks,
Pete

Your locator is //*[#id="quick-search-toggle"], there are 2 on the page. The first is invisible and the second is visible. By default selenium refers to the first element, sadly the element you mean is the second one, so you need another unique locator. Try this:
search_toggle = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[#class="row secondary"]//a[#id="quick-search-toggle"]')))
search_toggle.click()

First you need to open search box, then send search keys:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import os
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
browser = webdriver.Chrome(executable_path=os.path.abspath(os.getcwd()) + "/chromedriver", options=chrome_options)
link = 'https://www.independent.co.uk'
browser.get(link)
# accept privacy
button = browser.find_element_by_xpath('//*[#id="qcCmpButtons"]/button').click()
# open search box
li = browser.find_element_by_xpath('//*[#id="masthead"]/div[3]/nav[2]/ul/li[1]')
search_tab = li.find_element_by_tag_name('a').click()
# send keys to search box
search = browser.find_element_by_xpath('//*[#id="gsc-i-id1"]')
search.send_keys("python")
search.send_keys(Keys.RETURN)

Can you try with below steps
search_toggle = driver.find_element_by_xpath('//*[#class="row secondary"]/nav[2]/ul/li[1]/a')
search_toggle.click()

Python Selenium Scraping missing image from url

Hello all my task is to scrap source urls from offer link like this one.
But when I try to get the elements like this (Note that I make 2 requests of the url to get the cookies, because the first time is redirecting me to the main page):
driver = webdriver.Firefox(executable_path="C:\\selenium-drivers\\geckodriver.exe")
driver.get("http://www.kmart.com/joe-boxer-men-s-pajama-shirt-pants-plaid/p-046VA92629712P")
driver.get("http://www.kmart.com/joe-boxer-men-s-pajama-shirt-pants-plaid/p-046VA92629712P")
img_element = driver.find_elements_by_class_name("main-image")
No elements are found and when i try to search them in the source code in the browser with Ctrl+U they are missing. Why is this happening? Is anyone who can tell me how to get these images.

You just need to tell selenium to be patient and wait for element's visibility:
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# driver definition here
driver.get("http://www.kmart.com/joe-boxer-men-s-pajama-shirt-pants-plaid/p-046VA92629712P")
wait = WebDriverWait(driver, 10)
# get the main image element
img_element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'main-image')))
print(img_element.get_attribute("alt"))
driver.close()
For the demonstration purposes, it prints the alt attribute of the image, which is:
Joe Boxer Men's Pajama Shirt & Pants - Plaid

Or you can just find by xpath and then get image url
>>> driver.get('http://www.kmart.com/joe-boxer-men-s-pajama-shirt-pants-plaid/p-046VA92629712P')
>>> s = driver.find_element_by_xpath('//*[#id="overview"]/div[1]/img')
>>> s.get_attribute('src')
'http://c.shld.net/rpx/i/s/i/spin/-122/prod_2253990712?hei=624&wid=624&op_sharpen=1'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble scraping reviews using Selenium in Python - python

You need to add a space in your xpath. You have 'Ellipsis Box' when it should be 'Ellipsis Box ' //div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box '] I was able to find 6 elements using the corrected xpath.

Related

Use Python to Scrape for Data in Family Search Records

Selenium not able to find particular elements on slow loading page

How to get selenium web driver on python to find elements on css selectors of a following page?

Element Not Clickable - even though it is there

Python Selenium Scraping missing image from url

Categories

Resources