I am trying to get the whole data of this table. However, in the last row there is "Load More" table row that I do not know how to load. So far I have tried different approaches that did not work,
I tried to click on the row itself by this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {"class": "competition-leaderboard__table"})
i = 0
for team in table.find.all('tbody'):
rows = team.find_all('tr')
for row in rows:
i = i + 1
if (i == 51):
row.click()
//the scraping code for the first 50 elements
The code above throws an error saying that "'NoneType' object is not callable".
Another thing that I have tried that did not work is the following:
I tried to get the load more table row by its' class and click on it.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
load_more = driver.find_element_by_class_name('competition-leaderboard__load-more-wrapper')
load_more.click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
The code above also did not work.
So my question is how can I make python click on the "Load More" table row as in the HTML structure of the site it seems like "Load More" is not a button that is clickable.
In your code you have to accept cookies first, and then you can click 'Load more' button.
CSS selectors are the most suitable in this case.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get('https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/leaderboard')
wait = WebDriverWait(driver, 30)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".sc-pAyMl.dwWbEz .sc-AxiKw.kOAUSS>.sc-AxhCb.gsXzyw")))
cookies = driver.find_element_by_css_selector(".sc-pAyMl.dwWbEz .sc-AxiKw.kOAUSS>.sc-AxhCb.gsXzyw").click()
load_more = driver.find_element_by_css_selector(".competition-leaderboard__load-more-count").click()
time.sleep(10) # Added for you to make sure that both buttons were clicked
driver.close()
driver.quit()
I tested this snippet and it clicked the desired button.
Note that I've added WebDriverWait in order to wait until the first button is clickable.
UPDATE:
I added time.sleep(10) so you could see that both buttons are clicked.
Related
https://fbref.com/en/squads/0cdc4311/Augsburg-Stats provides buttons to transform a table to csv, which I would like to scrape. I click the buttons like
elements = driver.find_elements(By.XPATH, '//button[text()="Get table as CSV (for Excel)"]')
for element in elements:
element.click()
but I get an exception
ElementNotInteractableException: Message: element not interactable
This is the element I am trying to click.
Here's the full code (I added Adblock plus as a Chrome extension, which should be configured to test locally):
import pandas as pd
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
import time
import os
#activate adblock plus
path_to_extension = '/home/andreas/.config/google-chrome/Default/Extensions/cfhdojbkjhnklbpkdaibdccddilifddb/3.11.4_0'
options = Options()
options.add_argument('load-extension=' + path_to_extension)
#uses Chrome driver in usr/bin/ from https://chromedriver.chromium.org/downloads
driver = webdriver.Chrome(options=options)
#wait and switching back to tab with desired source
time.sleep(5)
driver.switch_to.window(driver.window_handles[0])
NO_OF_PREV_SEASONS = 5
df = pd.DataFrame()
urls = ['https://fbref.com/en/squads/247c4b67/Arminia-Stats']
for url in urls:
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, 'html.parser')
#click button -> accept cookies
element = driver.find_element(By.XPATH, '//button[text()="AGREE"]')
element.click()
for i in range(NO_OF_PREV_SEASONS):
elements = driver.find_elements(By.XPATH, '//button[text()="Get table as CSV (for Excel)"]')
for element in elements:
element.click()
#todo: get data
#click button -> navigate to next page
time.sleep(5)
element = driver.find_element(By.LINK_TEXT, "Previous Season")
element.click()
driver.quit()
button is inside the drop-down list (i.e. <span>Share & Export</span>) so you need to hover it first.
e.g.
from selenium.webdriver.common.action_chains import ActionChains
action_chain = ActionChains(driver)
hover = driver.find_element_by_xpath("// span[contains(text(),'Share & Export')]")
action_chain.move_to_element(hover).perform() # hover to show drop down list
driver.execute_script("window.scrollTo(0, 200)") # scroll down a bit
time.sleep(1) # wait for scrolling
button = driver.find_element_by_xpath("// button[contains(text(),'Get table as CSV (for Excel)')]")
action_chain.move_to_element(button).click().perform() # move to button and click
time.sleep(3)
output:
This also happens to me sometimes. One way to overcome this problem is by getting the X and Y coordinates of this button and clicking on it.
import pyautogui
for element in elements:
element_pos = element.location
element_size = element.size
x_coordinate, y_coordinate = elemnt_pos['x'], element_pos['y']
e_width, e_height = element_size['width'], element_size['height']
click_x = x_coordinate + e_width/2
click_y = y_coordinate + e_height/2
pyauotgui.click(click_x, click_y)
Other solution that you may try is to click on the tag that contains this button.
There are several issues here:
You have to click and open Share and Export tab and then click Get table as CSV button
You have to scroll the page to access the non-first tables.
So, your code can be something like this:
import pandas as pd
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
import time
import os
#activate adblock plus
path_to_extension = '/home/andreas/.config/google-chrome/Default/Extensions/cfhdojbkjhnklbpkdaibdccddilifddb/3.11.4_0'
options = Options()
options.add_argument('load-extension=' + path_to_extension)
options.add_argument("window-size=1920,1080")
#uses Chrome driver in usr/bin/ from https://chromedriver.chromium.org/downloads
driver = webdriver.Chrome(options=options)
actions = ActionChains(driver)
#wait and switching back to tab with desired source
time.sleep(5)
#driver.switch_to.window(driver.window_handles[0])
NO_OF_PREV_SEASONS = 5
df = pd.DataFrame()
urls = ['https://fbref.com/en/squads/247c4b67/Arminia-Stats']
for url in urls:
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, 'html.parser')
#click button -> accept cookies
element = driver.find_element(By.XPATH, '//button[text()="AGREE"]')
element.click()
for i in range(NO_OF_PREV_SEASONS):
elements = driver.find_elements(By.XPATH, "//div[#class='section_heading_text']//li[#class='hasmore']")
for element in elements:
actions.move_to_element(element).perform()
time.sleep(0.5)
element.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//button[#tip='Get a link directly to this table on this page']"))).click()
#todo: get data
i basically want to click every load more button thats on the page before running the rest of my code because otherwise i wont be able to access each profile.
There are 2 Problems:
first how do i even access it? i tried similar methods to the fancyCompLabel part of my code but it wont work.
second im not sure how i should loop through all buttons since i would assume the second button only starts loading until the first one is clicked.
heres the relevant html part and a picture of the button
<span type="button" class="md-text-button button-orange-white" onclick="loadFollowing();">mehr anzeigen</span>
Heres the code to access each profile but as you can see it only runs until the first load button.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
time.sleep(3)
# Set some Selenium Options
options = webdriver.ChromeOptions()
# options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Webdriver
wd = webdriver.Chrome(executable_path='/usr/bin/chromedriver', options=options)
# URL
url = 'https://www.techpilot.de/zulieferer-suchen?laserschneiden%202d%20(laserstrahlschneiden)'
# Load URL
wd.get(url)
# Get HTML
soup = BeautifulSoup(wd.page_source, 'html.parser')
wait = WebDriverWait(wd, 15)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#bodyJSP #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#efficientSearchIframe")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".hideFunctionalScrollbar #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
#wd.switch_to.default_content() # you do not need to switch to default content because iframe is closed already
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
results = wd.find_elements_by_css_selector(".fancyCompLabel")
for profil in results:
print(profil.text) #heres the rest of my code but its not relevant
wd.close()
As I see the second pop-up element located by .hideFunctionalScrollbar #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll is initially appearing out of the visible screen.
So after switching to the iframe you need to scroll to that element before trying to click on it.
Also, presence_of_all_elements_located doesn't actually wait for all the elements presence. It even doesn't know how many such elements will be. It returns once it finds at least 1 element matching the passed locator.
So I'd advice to add a short sleep after that line to allow all those elements to be actually loaded.
from selenium.webdriver.common.action_chains import ActionChains
wait = WebDriverWait(wd, 15)
actions = ActionChains(wd)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#bodyJSP #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#efficientSearchIframe")))
second_pop_up = wd.find_element_by_css(".hideFunctionalScrollbar #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")
actions.move_to_element(second_pop_up).build().perform()
time.sleep(0.5)
second_pop_up.click()
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
time.sleep(0.5)
for profil in results:
print(profil.text) #heres the rest of my code but its not relevant
wd.close()
So I am scraping reviews and skin type from Sephora and have run into a problem identifying how to get elements off of the page.
Sephora.com loads reviews dynamically after you scroll down the page so I have switched from beautiful soup to Selenium to get the reviews.
The Reviews have no ID, no name, nor a CSS identifier that seems to be stable. The Xpath doesn't seem to be recognized each time I try to use it by copying from chrome nor from firefox.
Here is an example of the HTML from the inspected element that I loaded in chrome:
Inspect Element view from the desired page
My Attempts thus far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("/Users/myName/Downloads/chromedriver")
url = 'https://www.sephora.com/product/the-porefessional-face-primer-P264900'
driver.get(url)
reviews = driver.find_elements_by_xpath(
"//div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box ']")
print("REVIEWS:", reviews)
Output:
| => /Users/myName/anaconda3/bin/python "/Users/myName/Documents/ScrapeyFile Group/attempt32.py"
REVIEWS: []
(base)
So basically an empty list.
ATTEMPT 2:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
# Open up a Firefox browser and navigate to web page.
driver = webdriver.Firefox()
driver.get(
"https://www.sephora.com/product/squalane-antioxidant-cleansing-oil-P416560?skuId=2051902&om_mmc=ppc-GG_1165716902_56760225087_pla-420378096665_2051902_257731959107_9061275_c&country_switch=us&lang=en&ds_rl=1261471&gclid=EAIaIQobChMIisW0iLbK6AIVaR6tBh005wUTEAYYBCABEgJVdvD_BwE&gclsrc=aw.ds"
)
#Scroll to bottom of page b/c its dynamically loading
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
#scrape stats and comments
comments = driver.find_elements_by_css_selector("div.css-7rv8g1")
print("!!!!!!Comments!!!!!")
print(comments)
OUTPUT:
| => /Users/MYNAME/anaconda3/bin/python /Users/MYNAME/Downloads/attempt33.py
!!!!!!Comments!!!!!
[]
(base)
Empty again. :(
I get the same results when I try to use different element selectors:
#scrape stats and comments
comments = driver.find_elements_by_class_name("css-7rv8g1")
I also get nothing when I tried this:
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box ']")
and This (notice the space after Ellipsis Box is gone :
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box']")
I have tried using the solutions outlined here and here but ti no avail -- I think there is something I don't understand about the page or selenium that I am missing since this is my first time using selenium so i'm a super nube :(
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"")
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get("https://www.sephora.fr/p/black-ink---classic-line-felt-liner---eyeliner-feutre-precis-waterproof-P3622017.html")
scrolls = 1
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
if scrolls < 0:
break
reviewText=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//ol[#class='bv-content-list bv-content-list-reviews']//li//div[#class='bv-content-summary-body']//div[1]")))
for textreview in reviewText:
print textreview.text
Output:
I've been scraping reviews from Sephora and basically, even if there is plenty of room for improvement, it works like this :
Clicks on "reviews" to access reviews
Loads all reviews by scrolling until there aren't any review left to load
Finds review text and skin type by CSS SELECTOR
def load_all_reviews(driver):
while True:
try:
driver.execute_script(
"arguments[0].scrollIntoView(true);",
WebDriverWait(driver, 10).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
driver.execute_script(
"arguments[0].click();",
WebDriverWait(driver, 20).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
except Exception as e:
break
def get_review_text(review):
try:
return review.find_element(By.CLASS_NAME, "bv-content-summary-body-text").text
except:
return "NA" # in case it doesnt find a review
def get_skin_type(review):
try:
return review.find_element(By.XPATH, '//*[#id="BVRRContainer"]/div/div/div/div/ol/li[2]/div[1]/div/div[2]/div[5]/ul/li[4]/span[2]').text
except:
return "NA" # in case it doesnt find a skin type
to use those you've got to create a webdriver and first call the load_all_reviews() function.
Then you've got to find reviews with :
reviews = driver.find_elements(By.CSS_SELECTOR, ".bv-content-review")
and finally you can call for each review the get_review() and get_skin_type() functions :
for review in reviews :
print(get_review_text(review))
print(get_skin_type(review))
i'm trying to scrape more than 10 pages of reviews from https://www.innisfree.com/kr/ko/ProductReviewList.do
However when i move to the next page and try to get the new page's reviews, i still get the first page's reviews only.
i used driver.execute_script("goPage(2)") and also time.sleep(5) but my code only gives me the first page's reviews.
''' i did not use for-loop just to see whether the results are different between page1 and page2'''
''' i imported beautifulsoup and selenium'''
here is my code:
url = "https://www.innisfree.com/kr/ko/ProductReviewList.do"
chromedriver = r'C:\Users\hhm\Downloads\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
print("this is page 1")
driver.execute_script("goPage(1)")
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
print('\n')
print("this is page 2")
driver.execute_script("goPage(2)")
time.sleep(5)
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
If your second page open as new window then you need to switch to another page and switch your selenium control to another window
Example:
# Opens a new tab
self.driver.execute_script("window.open()")
# Switch to the newly opened tab
self.driver.switch_to.window(self.driver.window_handles[1])
Source:
How to switch to new window in Selenium for Python?
https://www.techbeamers.com/switch-between-windows-selenium-python/
Try the following code.You need to click on each pagination link to reach to next page.you will get all 100 review comments.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
url = "https://www.innisfree.com/kr/ko/ProductReviewList.do"
chromedriver = r'C:\Users\hhm\Downloads\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
for i in range(2,12):
time.sleep(2)
soup=BeautifulSoup(driver.page_source,'html.parser')
nTypes = soup.select('.reviewList ul .newType div[class^=reviewCon] .reviewConTxt')
for nType in nTypes:
product = nType.select_one('.pdtName').text
print(product)
if i==11:
break
nextbutton=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='num']/a[text()='" +str(i)+"']")))
driver.execute_script("arguments[0].click();",nextbutton)
I am trying to scrape data from a website that returns results from a search criteria that spans into multiple pages... using Selenium, beautifulsoup on Python. first page is easy to read. Moving to next page requires to click on the '>' button. The element looks like this:
<a href ng-click="selectPage(page + 1, $event)" class="ng-binding">Next
I tried the following:
browser = webdriver.Chrome()
browser.get ("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
# scraping the first page
#now need to click on the ">" , so that it can take me to the next page
Control should go to the next page, so that I can scrape. There are
about 250 pages from these results.
In Chrome, if you right-click the page, in the context menu there will be an option called "inspect". Click that and find the element in the html. Once you find it, right click it and go Copy > Copy XPath. You can then use the browser.find_element_by_xpath method to assign that element to a variable. You can then use element.click() to click it.
Well, how you don't has provide the URL, I'll show an example to solve this.
I'm considering the button has an ID, but you can change to find by a class, etc.
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = Chrome()
browser.get("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
wait = WebDriverWait(browser, 30)
wait.until(EC.visibility_of_element_located((By.ID, 'next-button')))
# Next page
browser.find_element_by_id('next-button').click()
# Continuous your code ...