i'm trying to scrape more than 10 pages of reviews from https://www.innisfree.com/kr/ko/ProductReviewList.do
However when i move to the next page and try to get the new page's reviews, i still get the first page's reviews only.
i used driver.execute_script("goPage(2)") and also time.sleep(5) but my code only gives me the first page's reviews.
''' i did not use for-loop just to see whether the results are different between page1 and page2'''
''' i imported beautifulsoup and selenium'''
here is my code:
url = "https://www.innisfree.com/kr/ko/ProductReviewList.do"
chromedriver = r'C:\Users\hhm\Downloads\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
print("this is page 1")
driver.execute_script("goPage(1)")
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
print('\n')
print("this is page 2")
driver.execute_script("goPage(2)")
time.sleep(5)
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
If your second page open as new window then you need to switch to another page and switch your selenium control to another window
Example:
# Opens a new tab
self.driver.execute_script("window.open()")
# Switch to the newly opened tab
self.driver.switch_to.window(self.driver.window_handles[1])
Source:
How to switch to new window in Selenium for Python?
https://www.techbeamers.com/switch-between-windows-selenium-python/
Try the following code.You need to click on each pagination link to reach to next page.you will get all 100 review comments.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
url = "https://www.innisfree.com/kr/ko/ProductReviewList.do"
chromedriver = r'C:\Users\hhm\Downloads\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
for i in range(2,12):
time.sleep(2)
soup=BeautifulSoup(driver.page_source,'html.parser')
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
if i==11:
break
nextbutton=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='num']/a[text()='" +str(i)+"']")))
driver.execute_script("arguments[0].click();",nextbutton)
Related
I am trying to get the whole data of this table. However, in the last row there is "Load More" table row that I do not know how to load. So far I have tried different approaches that did not work,
I tried to click on the row itself by this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {"class": "competition-leaderboard__table"})
i = 0
for team in table.find.all('tbody'):
rows = team.find_all('tr')
for row in rows:
i = i + 1
if (i == 51):
row.click()
//the scraping code for the first 50 elements
The code above throws an error saying that "'NoneType' object is not callable".
Another thing that I have tried that did not work is the following:
I tried to get the load more table row by its' class and click on it.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
load_more = driver.find_element_by_class_name('competition-leaderboard__load-more-wrapper')
load_more.click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
The code above also did not work.
So my question is how can I make python click on the "Load More" table row as in the HTML structure of the site it seems like "Load More" is not a button that is clickable.
In your code you have to accept cookies first, and then you can click 'Load more' button.
CSS selectors are the most suitable in this case.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get('https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/leaderboard')
wait = WebDriverWait(driver, 30)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".sc-pAyMl.dwWbEz .sc-AxiKw.kOAUSS>.sc-AxhCb.gsXzyw")))
cookies = driver.find_element_by_css_selector(".sc-pAyMl.dwWbEz .sc-AxiKw.kOAUSS>.sc-AxhCb.gsXzyw").click()
load_more = driver.find_element_by_css_selector(".competition-leaderboard__load-more-count").click()
time.sleep(10) # Added for you to make sure that both buttons were clicked
driver.close()
driver.quit()
I tested this snippet and it clicked the desired button.
Note that I've added WebDriverWait in order to wait until the first button is clickable.
UPDATE:
I added time.sleep(10) so you could see that both buttons are clicked.
So I am scraping reviews and skin type from Sephora and have run into a problem identifying how to get elements off of the page.
Sephora.com loads reviews dynamically after you scroll down the page so I have switched from beautiful soup to Selenium to get the reviews.
The Reviews have no ID, no name, nor a CSS identifier that seems to be stable. The Xpath doesn't seem to be recognized each time I try to use it by copying from chrome nor from firefox.
Here is an example of the HTML from the inspected element that I loaded in chrome:
Inspect Element view from the desired page
My Attempts thus far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("/Users/myName/Downloads/chromedriver")
url = 'https://www.sephora.com/product/the-porefessional-face-primer-P264900'
driver.get(url)
reviews = driver.find_elements_by_xpath(
"//div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box ']")
print("REVIEWS:", reviews)
Output:
| => /Users/myName/anaconda3/bin/python "/Users/myName/Documents/ScrapeyFile Group/attempt32.py"
REVIEWS: []
(base)
So basically an empty list.
ATTEMPT 2:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
# Open up a Firefox browser and navigate to web page.
driver = webdriver.Firefox()
driver.get(
"https://www.sephora.com/product/squalane-antioxidant-cleansing-oil-P416560?skuId=2051902&om_mmc=ppc-GG_1165716902_56760225087_pla-420378096665_2051902_257731959107_9061275_c&country_switch=us&lang=en&ds_rl=1261471&gclid=EAIaIQobChMIisW0iLbK6AIVaR6tBh005wUTEAYYBCABEgJVdvD_BwE&gclsrc=aw.ds"
)
#Scroll to bottom of page b/c its dynamically loading
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
#scrape stats and comments
comments = driver.find_elements_by_css_selector("div.css-7rv8g1")
print("!!!!!!Comments!!!!!")
print(comments)
OUTPUT:
| => /Users/MYNAME/anaconda3/bin/python /Users/MYNAME/Downloads/attempt33.py
!!!!!!Comments!!!!!
[]
(base)
Empty again. :(
I get the same results when I try to use different element selectors:
#scrape stats and comments
comments = driver.find_elements_by_class_name("css-7rv8g1")
I also get nothing when I tried this:
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box ']")
and This (notice the space after Ellipsis Box is gone :
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box']")
I have tried using the solutions outlined here and here but ti no avail -- I think there is something I don't understand about the page or selenium that I am missing since this is my first time using selenium so i'm a super nube :(
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"")
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get("https://www.sephora.fr/p/black-ink---classic-line-felt-liner---eyeliner-feutre-precis-waterproof-P3622017.html")
scrolls = 1
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
if scrolls < 0:
break
reviewText=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//ol[#class='bv-content-list bv-content-list-reviews']//li//div[#class='bv-content-summary-body']//div[1]")))
for textreview in reviewText:
print textreview.text
Output:
I've been scraping reviews from Sephora and basically, even if there is plenty of room for improvement, it works like this :
Clicks on "reviews" to access reviews
Loads all reviews by scrolling until there aren't any review left to load
Finds review text and skin type by CSS SELECTOR
def load_all_reviews(driver):
while True:
try:
driver.execute_script(
"arguments[0].scrollIntoView(true);",
WebDriverWait(driver, 10).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
driver.execute_script(
"arguments[0].click();",
WebDriverWait(driver, 20).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
except Exception as e:
break
def get_review_text(review):
try:
return review.find_element(By.CLASS_NAME, "bv-content-summary-body-text").text
except:
return "NA" # in case it doesnt find a review
def get_skin_type(review):
try:
return review.find_element(By.XPATH, '//*[#id="BVRRContainer"]/div/div/div/div/ol/li[2]/div[1]/div/div[2]/div[5]/ul/li[4]/span[2]').text
except:
return "NA" # in case it doesnt find a skin type
to use those you've got to create a webdriver and first call the load_all_reviews() function.
Then you've got to find reviews with :
reviews = driver.find_elements(By.CSS_SELECTOR, ".bv-content-review")
and finally you can call for each review the get_review() and get_skin_type() functions :
for review in reviews :
print(get_review_text(review))
print(get_skin_type(review))
I've written a script in python in combination with selenium to scrape the links of different posts from different pages while clicking on the next page button and get the title of each post from its inner page. Although the content I'm trying to deal here are static ones, I used selenium to see how it parses items while clicking on the next pages. I'm only after any soultion related to selenium.
Website address
If I define a blank list and extend all the links to it then eventually I can parse all the titles reusing those links from their inner pages when clicking on the next page button is done but that is not what I want.
However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. In short, I wish do the two things simultaneously.
I've tried with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
driver.execute_script("arguments[0].scrollIntoView();",elem)
elem.click()
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
When I run the above script, It parses the title of different posts by reusing the link from the first page but breaks throwing this error raise TimeoutException(message, screen, stacktrace)
when it hits this elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']"))) line.
How can scrape the title of each post from their inner pages collecting link from first page and then click on the next page button in order to repeat the process until it is done?
The reason you are getting no next button because when traverse each inner link at the end of that loop it can't find the next button.
You need to take each nexturl like below and execute.
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(pageno) #where page will start from 2
Try below code.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
npage = 2
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
driver.get(urlnext.format(npage))
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
npage=npage+1
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
I am trying to scrape data from a website that returns results from a search criteria that spans into multiple pages... using Selenium, beautifulsoup on Python. first page is easy to read. Moving to next page requires to click on the '>' button. The element looks like this:
<a href ng-click="selectPage(page + 1, $event)" class="ng-binding">Next
I tried the following:
browser = webdriver.Chrome()
browser.get ("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
# scraping the first page
#now need to click on the ">" , so that it can take me to the next page
Control should go to the next page, so that I can scrape. There are
about 250 pages from these results.
In Chrome, if you right-click the page, in the context menu there will be an option called "inspect". Click that and find the element in the html. Once you find it, right click it and go Copy > Copy XPath. You can then use the browser.find_element_by_xpath method to assign that element to a variable. You can then use element.click() to click it.
Well, how you don't has provide the URL, I'll show an example to solve this.
I'm considering the button has an ID, but you can change to find by a class, etc.
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = Chrome()
browser.get("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
wait = WebDriverWait(browser, 30)
wait.until(EC.visibility_of_element_located((By.ID, 'next-button')))
# Next page
browser.find_element_by_id('next-button').click()
# Continuous your code ...
I am trying to scrape links to song pages for some artists on genius.com, but I'm running into issues because the links to the individual song pages are displayed inside a popup modal window.
The modal window doesn't load all links in one go, and instead loads more content via ajax when you scroll down to the bottom of the modal.
I tried using code to scroll to the bottom of the page but unfortunately that just scrolled in the window behind the modal rather than the modal itself:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
So then I tried selecting the last element in the modal and scrolling to that (with the idea of doing that a few times until all song pages had been loaded), but it wouldn't scroll far enough to get the website to load more content
last_element = driver.find_elements_by_xpath('//div[#class="mini_card-metadata"]')[-1]
last_element.location_once_scrolled_into_view
Here is my code so far:
import os
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_driver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chrome_driver
driver = webdriver.Chrome(chrome_driver)
base_url = 'https://genius.com/artists/Stormzy'
driver.get(base_url)
xpath_str = '//div[contains(text(),"Show all songs by Stormzy")]'
driver.find_element_by_xpath(xpath_str).click()
Is there a way to extract all the song page links for the artist?
Try below code to get required output:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
driver = web.Chrome()
base_url = 'https://genius.com/artists/Stormzy'
driver.get(base_url)
# Open modal
driver.find_element_by_xpath('//div[normalize-space()="Show all songs by Stormzy"]').click()
song_locator = By.CSS_SELECTOR, 'a.mini_card.mini_card--small'
# Wait for first XHR complete
wait(driver, 10).until(EC.visibility_of_element_located(song_locator))
# Get current length of songs list
current_len = len(driver.find_elements(*song_locator))
while True:
# Load new XHR until it's possible
driver.find_element(*song_locator).send_keys(Keys.END)
try:
wait(driver, 3).until(lambda x: len(driver.find_elements(*song_locator)) > current_len)
current_len = len(driver.find_elements(*song_locator))
# Return full list of songs
except TimeoutException:
songs_list = [song.get_attribute('href') for song in driver.find_elements(*song_locator)]
break
print(songs_list)
This should allow you to request new XHR until length of songs list became constant and finally return the list of links
When you scroll to bottom of modal dialog it call
$scrollable_data_ctrl.load_next();
As option you can try execute it until new results appear in modal
driver.execute_script("$scrollable_data_ctrl.load_next();")