Scraper getting only a few names out of numerous - python

I've written a scraper in python in combination with selenium to get all the product names from redmart.com. Every time I run my code, i get only 27 names from that page although the page has got numerous names. FYI, the page has got lazy-loading method enabled. My scraper can reach the bottom of the page but scrape only 27 names. I can't understand where I'm getting lost with the logic I've applied in my scraper. Hope to get any workaround.
Here is the script I've written so far:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://redmart.com/new")
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
wait.until(lambda driver: driver.execute_script("return document.body.scrollHeight;") > check_height)
check_height = driver.execute_script("return document.body.scrollHeight;")
except:
break
for names in driver.find_elements_by_css_selector('.description'):
item_name = names.find_element_by_css_selector('h4 a').text
print(item_name)
driver.quit()

You have to wait for new content to be loaded.
Here is a very simple example:
driver.get('https://redmart.com/new')
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 18 products
driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(5) # wait for new content to be loaded
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 36 products
It works.
You can also look at XHR requests and try to scrape anything You want without using "time.sleep()" and "driver.execute_script".
For example, while scrolling their website, new products are loaded from this URL:
https://api.redmart.com/v1.6.0/catalog/search?q=new&pageSize=18&page=1
As you can see, it is possible to modify parameters like pageSize (max 100 products) and page. With this URL you can scrape all products without even using Selenium and Chrome. You can do all of this with Python Requests

Related

How to do pagination with scroll in Selenium?

I need to do pagination for this page:
I read this question and I try this:
scrolls = 10
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
if scrolls < 0:
break
I need to scroll down for getting all the products, but I don't know how many time I need to scroll for getting all the products.
I also tried to have a big screen
'SELENIUM_DRIVER_ARGUMENTS': ['--no-sandbox', '--window-size=1920,30000'],
and scroll down
time.sleep(10)
self.driver.execute_script("window.scrollBy(0, 30000);")
Does someone have an Idea how to get all products ?
I'm open to another solution, if Selenium is not the best for this case.
Thanks.
UPDATE 1:
I need to have all product IDs. for having the product IDs I use this:
products = response.css('div.jfJiHa > .iepIep')
for product in products:
detail_link = product.css('a.jXwbaQ::attr("href")').get()
product_id = re.findall(r'products/(\d+)', detail_link)[0]
As commented, without seeing your whole spider it is hard to see where you are going wrong here, but if we assume that your parsing is using the scrapy response then that is why you are always just getting 30 products.
You need to create a new selector from the driver after each scroll and query that. A full example of code that gets 300 items from the page is
import re
import time
from pprint import pprint
import parsel
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Firefox
with Firefox() as driver:
driver.get("https://www.compraonline.bonpreuesclat.cat/products/search?q=pasta")
all_items = {}
while True:
sel = parsel.Selector(driver.page_source)
for product in sel.css("div[data-test] h3 > a"):
name = product.css("::text").get()
product_id = re.search("(\d+)", product.attrib["href"]).group()
all_items[product_id] = name
try:
element = driver.find_element_by_css_selector(
"div[data-test] + div.iepIep:not([data-test])"
)
except NoSuchElementException:
break
driver.execute_script("arguments[0].scrollIntoView(true);", element)
time.sleep(1)
pprint(all_items)
print("Number of items =", len(all_items))
The key bits of this
After getting the page using driver.get we start looping
We create a new Selector (here I directly use parsel.Selector which is what scrapy uses internally)
We extract the info we need. Displayed products all have a data-test attribute. If this was a scrapy.Spider I'd yield the information, but here I just add it to a dictionary of all items.
After getting all the visible items, we try to find the first following sibling of a div with a data-test attribute , that doesn't have a data-test attribute (using the css + symbol)
If no such element exists (because we have seen all items) then break out of the loop, otherwise scroll that element into view and pause a second
Repeat until all items have been parsed
Try scrolling visible screen height amount page down each time reading the presented products until the //button[#data-test='footer-feedback-button'] or any other element located on the bottom is visible
This code may help -
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 30)
driver.get('https://www.compraonline.bonpreuesclat.cat/products/search?q=pasta')
BaseDivs = driver.find_elements_by_xpath("//div[contains(#class,\"base__Wrapper\")]")
for div in BaseDivs:
try:
wait.until(EC.visibility_of_element_located((By.XPATH, "./descendant::img")))
driver.execute_script("return arguments[0].scrollIntoView(true);", div)
except StaleElementReferenceException:
continue
This code will wait for the image to load and then focus on the element. This way it will automatically scroll down till the end of the page.
Mark it answer if this is what you are looking for.
I solved my problem but not with Selenium, We can have all the products of search by another request:
https://www.compraonline.bonpreuesclat.cat/api/v4/products/search?limit=1000&offset=0&sort=favorite&term=pasta

Not sure how to get elements from dynamically loading webpage using selenium

So I am scraping reviews and skin type from Sephora and have run into a problem identifying how to get elements off of the page.
Sephora.com loads reviews dynamically after you scroll down the page so I have switched from beautiful soup to Selenium to get the reviews.
The Reviews have no ID, no name, nor a CSS identifier that seems to be stable. The Xpath doesn't seem to be recognized each time I try to use it by copying from chrome nor from firefox.
Here is an example of the HTML from the inspected element that I loaded in chrome:
Inspect Element view from the desired page
My Attempts thus far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("/Users/myName/Downloads/chromedriver")
url = 'https://www.sephora.com/product/the-porefessional-face-primer-P264900'
driver.get(url)
reviews = driver.find_elements_by_xpath(
"//div[#id='ratings-reviews']//div[#data-comp='Ellipsis Box ']")
print("REVIEWS:", reviews)
Output:
| => /Users/myName/anaconda3/bin/python "/Users/myName/Documents/ScrapeyFile Group/attempt32.py"
REVIEWS: []
(base)
So basically an empty list.
ATTEMPT 2:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
# Open up a Firefox browser and navigate to web page.
driver = webdriver.Firefox()
driver.get(
"https://www.sephora.com/product/squalane-antioxidant-cleansing-oil-P416560?skuId=2051902&om_mmc=ppc-GG_1165716902_56760225087_pla-420378096665_2051902_257731959107_9061275_c&country_switch=us&lang=en&ds_rl=1261471&gclid=EAIaIQobChMIisW0iLbK6AIVaR6tBh005wUTEAYYBCABEgJVdvD_BwE&gclsrc=aw.ds"
)
#Scroll to bottom of page b/c its dynamically loading
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
#scrape stats and comments
comments = driver.find_elements_by_css_selector("div.css-7rv8g1")
print("!!!!!!Comments!!!!!")
print(comments)
OUTPUT:
| => /Users/MYNAME/anaconda3/bin/python /Users/MYNAME/Downloads/attempt33.py
!!!!!!Comments!!!!!
[]
(base)
Empty again. :(
I get the same results when I try to use different element selectors:
#scrape stats and comments
comments = driver.find_elements_by_class_name("css-7rv8g1")
I also get nothing when I tried this:
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box ']")
and This (notice the space after Ellipsis Box is gone :
comments = driver.find_elements_by_xpath(
"//div[#data-comp='GridCell Box']//div[#data-comp='Ellipsis Box']")
I have tried using the solutions outlined here and here but ti no avail -- I think there is something I don't understand about the page or selenium that I am missing since this is my first time using selenium so i'm a super nube :(
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"")
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get("https://www.sephora.fr/p/black-ink---classic-line-felt-liner---eyeliner-feutre-precis-waterproof-P3622017.html")
scrolls = 1
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
if scrolls < 0:
break
reviewText=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//ol[#class='bv-content-list bv-content-list-reviews']//li//div[#class='bv-content-summary-body']//div[1]")))
for textreview in reviewText:
print textreview.text
Output:
I've been scraping reviews from Sephora and basically, even if there is plenty of room for improvement, it works like this :
Clicks on "reviews" to access reviews
Loads all reviews by scrolling until there aren't any review left to load
Finds review text and skin type by CSS SELECTOR
def load_all_reviews(driver):
while True:
try:
driver.execute_script(
"arguments[0].scrollIntoView(true);",
WebDriverWait(driver, 10).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
driver.execute_script(
"arguments[0].click();",
WebDriverWait(driver, 20).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
)
),
)
except Exception as e:
break
def get_review_text(review):
try:
return review.find_element(By.CLASS_NAME, "bv-content-summary-body-text").text
except:
return "NA" # in case it doesnt find a review
def get_skin_type(review):
try:
return review.find_element(By.XPATH, '//*[#id="BVRRContainer"]/div/div/div/div/ol/li[2]/div[1]/div/div[2]/div[5]/ul/li[4]/span[2]').text
except:
return "NA" # in case it doesnt find a skin type
to use those you've got to create a webdriver and first call the load_all_reviews() function.
Then you've got to find reviews with :
reviews = driver.find_elements(By.CSS_SELECTOR, ".bv-content-review")
and finally you can call for each review the get_review() and get_skin_type() functions :
for review in reviews :
print(get_review_text(review))
print(get_skin_type(review))

Scraper unable to get names from next pages

I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()
You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)
It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source

Scroll modal window using Selenium in Python

I am trying to scrape links to song pages for some artists on genius.com, but I'm running into issues because the links to the individual song pages are displayed inside a popup modal window.
The modal window doesn't load all links in one go, and instead loads more content via ajax when you scroll down to the bottom of the modal.
I tried using code to scroll to the bottom of the page but unfortunately that just scrolled in the window behind the modal rather than the modal itself:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
So then I tried selecting the last element in the modal and scrolling to that (with the idea of doing that a few times until all song pages had been loaded), but it wouldn't scroll far enough to get the website to load more content
last_element = driver.find_elements_by_xpath('//div[#class="mini_card-metadata"]')[-1]
last_element.location_once_scrolled_into_view
Here is my code so far:
import os
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_driver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chrome_driver
driver = webdriver.Chrome(chrome_driver)
base_url = 'https://genius.com/artists/Stormzy'
driver.get(base_url)
xpath_str = '//div[contains(text(),"Show all songs by Stormzy")]'
driver.find_element_by_xpath(xpath_str).click()
Is there a way to extract all the song page links for the artist?
Try below code to get required output:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
driver = web.Chrome()
base_url = 'https://genius.com/artists/Stormzy'
driver.get(base_url)
# Open modal
driver.find_element_by_xpath('//div[normalize-space()="Show all songs by Stormzy"]').click()
song_locator = By.CSS_SELECTOR, 'a.mini_card.mini_card--small'
# Wait for first XHR complete
wait(driver, 10).until(EC.visibility_of_element_located(song_locator))
# Get current length of songs list
current_len = len(driver.find_elements(*song_locator))
while True:
# Load new XHR until it's possible
driver.find_element(*song_locator).send_keys(Keys.END)
try:
wait(driver, 3).until(lambda x: len(driver.find_elements(*song_locator)) > current_len)
current_len = len(driver.find_elements(*song_locator))
# Return full list of songs
except TimeoutException:
songs_list = [song.get_attribute('href') for song in driver.find_elements(*song_locator)]
break
print(songs_list)
This should allow you to request new XHR until length of songs list became constant and finally return the list of links
When you scroll to bottom of modal dialog it call
$scrollable_data_ctrl.load_next();
As option you can try execute it until new results appear in modal
driver.execute_script("$scrollable_data_ctrl.load_next();")

iterate over result pages using selenium and python: StaleElementReferenceException

I think for people who understood the selenium tool will now laugh but maybe you can share you're knowledge because really want to laugh now, too.
My code is this:
def getZooverLinks(country):
global countries
countries = country
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element_by_class_name('next')
links = []
for page in xrange(1,4):
WebDriverWait(driver, 60).until(lambda driver :driver.find_element_by_class_name('next'))
divList = driver.find_elements_by_class_name('blue2')
for div in divList:
hrefTag = div.find_element_by_css_selector('a').get_attribute('href')
print(hrefTag)
newLink = zooverWeb + hrefTag
links.append(newLink)
button.click()
driver.implicitly_wait(10)
time.sleep(60)
return links
So I want to iterate over all result pages and always get the links from the divs having the class="blue2" and then follow the "next"-link to get to the next result page.
But always I get a StaleElementReferenceException saying:
"Message: Element not found in the cache - perhaps the page has changed since it was looked up"
But the layout of pages is always the same. So what is the problem here? Is the url after the click not handed over to the driver since the page changes too? How can I do that?
It is a little bit tricky to follow the pagination on this particular site.
Here is the set of things that helped me to overcome the issue with StaleElementReferenceException:
find elements inside the loop since the page changes
use Explicit Waits to wait for the specific page numbers to become active
Working code:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
country = "albanie"
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(10)
links = []
for page in xrange(1, 4):
# tricky part - waiting for the page number on the top to appear
if page > 1:
WebDriverWait(driver, 60).until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div.entityPagingTop strong'), str(page)))
else:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.CLASS_NAME, 'next')))
divList = driver.find_elements_by_class_name('blue2')
for div in divList:
hrefTag = div.find_element_by_css_selector('a').get_attribute('href')
newLink = zooverWeb + hrefTag
links.append(newLink)
driver.find_element_by_class_name("next").click()
print links

Categories

Resources