Web Scraping with Selenium, trying to solve a parallelize mistake with ThreadPool

Web Scraping with Selenium, trying to solve a parallelize mistake with ThreadPool - python

I`m have done Web Scraping for reviews websites perharps the time spent on running the code is exhausting, cause at least spent a day for 20k reviews approx.
To solve this problem, I've been looking for new possibilities, the one that was most common is parallelize tasks.
So I tried to implement ThreadPool from multiprocessing.dummy.Pool but I 've always seen the same mistake / error, I don´t know how to solve it.
The common error obtained is:
Message: stale element reference: element is not attached to the page document (Session info: chrome=96.0.4664.110)
The code I developed is the next one (all the extract info is stored in lists, perharps I've thought to use dicts....
Any suggestions would be appreciated):
Step 1: Start with an input about the search item you want to obtain reviews.
Step 2: Obtain all the items (urls) related to the input through pagination (click buttons if they enable extraction of more items) and also quick information shown as cover.
Step 3: For each item url, access the content: driver.get(url), then extract detailed info about the item product (price: actual and previous, rating avg, color, shape, description ... as much info as possible) , all this info is stored in another list for all the items details.
Step 4: Following in the same url, try/except if it is possible to access "view all reviews", just click button related to. And again pagination of all the urls containing reviews about that specific item.
All of these step have been successfully completed, and time spent was no long as extracting reviews.
For example a item with 2K reviews, the previous steps spent around 10 minutes, but the process of extracting information from the reviews was about 4 hours.
I have to admit that for each review I search elements (title,rating,verified,content,useful votes,review...) with the correspondant id from the user, so this might be a cause of spending too much time.
Anyways I try to solve or fix this problem with ThreadPool.
The function extract_info_per_review(url) is the following:
def extract_info_per_review(url):
#header_reviews = ['Username','Rating','Title','Date','Size','Verified','Review','Images','Votes']
driver.get(url)
all_reviews_page = list()
all_reviews = driver.find_elements(By.XPATH, "//div[#class='a-section review aok-relative']")
for review in all_reviews:
### Option 1:
user_id = review.get_attribute('id')
try_data_review = list()
### Option 2:
# try:
# user_id = review.get_attribute('id')
# try_data_review = list()
# except:
# print('Not id found')
# pass
### Option 3:
# try_data_review = list()
# ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
# user_id = WebDriverWait(driver, 60,ignored_exceptions=ignored_exceptions).until(expected_conditions.presence_of_element_located(review.get_attribute('id')))
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#class='a-profile-name']".format(user_id)).text)
except:
try_data_review.append('Not username')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//i[#data-hook='review-star-rating']//span[#class='a-icon-alt']".format(user_id)).get_attribute('innerHTML'))
except:
try_data_review.append('Not rating')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#class='cr-original-review-content']".format(user_id)).text)
except:
try_data_review.append('Not title')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#data-hook='review-date']".format(user_id)).text)
except:
try_data_review.append('Not date')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//a[#data-hook='format-strip']".format(user_id)).text)
except:
try_data_review.append('Not size')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#data-action='reviews:filter-action:push-state']".format(user_id)).text)
except:
try_data_review.append('Not verified')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#class='a-size-base review-text review-text-content']".format(user_id)).text)
except:
try_data_review.append('Not review')
try:
try_images = list()
images = driver.find_elements(By.XPATH,"//div[#id='{}']//div[#class='review-image-tile-section']//img[#alt='Imagen del cliente']".format(user_id))
for image in images:
try_images.append(image.get_attribute('src'))
try_data_review.append(try_images)
except:
try_data_review.append('Not image')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[#id='{}']//span[#class='a-size-base a-color-tertiary cr-vote-text']".format(user_id)).text)
except:
try_data_review.append('Not votes')
all_reviews_page.append(try_data_review)
print('Review extraction done')
return all_reviews_page
And the implementation to the ThreadPool is:
data_items_reviews = list()
try:
reviews_view_sort()
reviews_urls = list()
urls_review_from_item = pagination_data_urls()
time.sleep(2)
### V1: Too much time spent
# for url in urls_review_from_item:
# reviews_urls.append(extract_info_per_review(url))
# time.sleep(2)
# data_items_reviews.append(reviews_urls)
### V2: Try ThreadPool
pool = ThreadPool(3) # 1,3,5,7
results = pool.map(extract_info_per_review, urls_review_from_item)
data_items_reviews.append(results)
except:
print('Item without reviews')
data_items_reviews.append('Item without reviews')
All the implementations:
import random
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from multiprocessing.dummy import Pool as ThreadPool
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selectorlib import Extractor
import os
from datetime import date
import shutil
import json
import pandas as pd
from datetime import datetime
import csv
Last imports come from another task in relation to how to store information more efficiently, I should ask another question..,
I'm stuck, any recommendations I'm all for it.
Thanks to all of you, I'm in touch.

'Message: stale element reference: element is not attached to the page document (Session info: chrome=96.0.4664.110)'
I believe you may need to create and manage separate Selenium driver instances for each thread or process. If thread 1 loads a page, and then thread 2 loads a page; all the state attached to the driver on thread 1 is invalidated (including elements, url, etc.). You need to have each thread create it's own driver which will be a separate browser instance. You should be able to create quite a few browsers at the same time, each processing one url at a time.
You could implement the multiprocessing via ThreadPool or ThreadPoolExecutor as you've started. I've had luck in the past using multiprocessing.Process along with multiprocessing.Queue (your queue could be a list of URLs and each process parses one URL at a time). Regardless, each thread or process needs to maintain it's own selenium driver instance.
A simple implementation could even skip any multiprocessing in python by having the entire scrape_one_url.py as one python script (creating it's browser/driver and performing the job); and a separate script could execute 'scrape_one_url.py' at the system-level. You could do multiprocessing just by starting 5-10 'scrape_one_url.py' scripts at the same time on different URLs.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.ThreadPool
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
One note on Threading vs Multiprocessing in Python, threading will not help CPU-bound performance. This is well documented, see Multiprocessing vs Threading Python

Related

Python Scraper Won't Complete

I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()

The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference

Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!

Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search

PYTHON scrapy selenium WebDriverWait

Experts here, I am searching for your help if you don't mind it.
Recently, I am working out a web crawler using scrapy and selenium in python. My mind has crush.
I just want to ask whether it is possible that you still get empty even if you've used the statement
WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))
to get those elements. And also, it even doesn't take 100 second to get empty. Why?
And by the way, it is a random thing, which means this phenomenon happens anywhere, anytime.
Does getting empty had something about my network connection?
Could you help me or give me some opinions, suggestion about the question above?
Thanks a lot!
-----------------------supplementary notes-----------------------
Thanks for the heads up.
In summary, I used scrapy and selenium to crawl a site about reviews and write the username, posting time, comment content, etc. to a .xlsx file via pipeline.py, I wanted it to be as fast as possible while gathering complete information.
A page with many people commenting, and because the review text is too long it gets put away, which means that almost 20 comments per page have their expand button.
Therefore, I need to use selenium to click the expand button and then use driver to fetch the complete comment. Common sense dictates that it takes a bit of time to load after the expand button is clicked, and I believe the time it takes depends on the speed of the network. So using WebDriverWait seems to be a wise choice here. After my practice, the default parameters timeout=10 and poll_frequency=0.5 seem to be too slow and error-prone. So I considered using the specifications of timeout=100 and poll_frequency=0.1.
However, the problem is that every time I run the project through the cmd statement scrapy crawl spider, there are always several comment crawls that are empty, and each time the location of the empty is different. I've thought about using time.sleep() to force a stop, but that would take a lot of time if every page did that, and while it's certainly a more useful way to get complete information. Also, it's looks not so elegant and a little bit clumsy in my opinion.
Have I express my question clearly?
-------------------------------add something--------------------------------
The exact meaning of I got somwhere empty is shown as the picture below.
---------------------------add my code--------------------------2022/5/18
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[#class='unfold']")))
# Here's how I think about and design my loop body.
# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.
for i in range(len(unfolds)):
unfolds[i].click()
time.sleep(1)
# After the javascript, the `div[#class='review-content clearfix']` appear,
# and some of the full review content will be put in a `<p></p>` label
find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='review-content clearfix']/p")))
full_content_p = [j.text for j in find_full_content_p]
# and some of them will just put in `div[#class='review-content clearfix']` itself.
find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='review-content clearfix']")))
full_content_div = [j.text for j in find_full_content_div]
# and I made a list merge
full_content_p.extend(full_content_div)
full_content.append("".join(full_content_p))
words.append(len("".join(full_content_p)))
time.sleep(1)
# then put it away
WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[#class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})
AND, this is the code of an expert I genuinely respect named sound wave.(This is slightly modified, the core code has not been changed)
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
words.append(len(review.text))
print('done',len(reviews),end='\r')
pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})

NEW
Added code for the site douban. To export the scraped data to a csv see the pandas code in the OLD section below
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service('...'))
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
print('done',len(reviews),end='\r')
OLD
For the website you mentioned (imdb.com) in order to scrape the hidden content there is no need to click on the show more button because the text is already loaded in the HTML code, simply it is not shown on the site. So you can scrape all the comments in a single time. Code below stores users, dates and reviews in seprate lists, and finally save data to a .csv file.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service(chromedriver_path))
driver.get('https://www.imdb.com/title/tt1683526/reviews')
# sets a maximum waiting time for .find_element() and similar commands
driver.implicitly_wait(10)
reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]
# store data in a csv file
import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)
To print a single review you can do something like this
i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')
the output (truncated) is
User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...

Youtube scraping with selenium :not getting all comments

I am trying to scrape youtube comments using selenium with python. Below is the code which scrapes just the one comment and throws error
driver = webdriver.Chrome()
url="https://www.youtube.com/watch?v=MNltVQqJhRE"
driver.get(url)
wait(driver, 5500)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 500);")
driver.implicitly_wait(5000)
#content = driver.find_element_by_xpath('//*[#id="contents"]')
comm=driver.find_element_by_xpath('//div[#class="style-scope ytd-item-section-renderer"]')
comm1=comm.find_elements_by_xpath('//yt-formatted-string[#id="content-text"]')
#print(comm.text)
for i in range(50):
print(comm1[i].text,end=' ')
This is the output I am getting. How do I get all the comments on that page??? Can anyone help me with this.
Being a sucessful phyton freelancer really mean to me because if I able to make $2000 in month I can really help my family financial, improve my skill, and have a lot of time to refreshing. So thanks Qazi, you really help me :D
Traceback (most recent call last):
File "C:\Python36\programs\Web scrap\YT_Comm.py", line 19, in <module>
print(comm1[i].text,end=' ')
IndexError: list index out of range

An IndexError means you’re attempting to access a position in a list that doesn’t exist. You’re iterating over your list of elements (comm1) exactly 50 times, but there are fewer than 50 elements in the list, so eventually you attempt to access an index that doesn’t exist.
Superficially, you can solve your problem by changing your iteration to loop over exactly as many elements as exist in your list—no more and no less:
for element in comm1:
print(element.text, end=‘ ‘)
But that leaves you with the problem of why your list has fewer than 50 elements. The video you’re scraping has over 90 comments. Why doesn’t your list have all of them?
If you take a look at the page in your browser, you'll see that the comments load progressively using the infinite scroll technique: when the user scrolls to the bottom of the document, another "page" of comments are fetched and rendered, increasing the length of the document. To load more comments, you will need to trigger this behavior.
But depending on the number of comments, one fetch may not be enough. In order to trigger the fetch and rendering of all of the content, then, you will need to:
attempt to trigger a fetch of additional content, then
determine whether additional content was fetched, and, if so,
repeat (because there might be even more).
Triggering a fetch
We already know that additional content is fetched by scrolling to the bottom of the content container (the element with id #contents), so let's do that:
driver.execute_script(
"window.scrollTo(0, document.querySelector('#contents').scrollHeight);")
(Note: Because the content resides in an absolute-positioned element, document.body.scrollHeight will always be 0 and will not trigger a scroll.)
Waiting for the content container
But as with any browser automation, we're in a race with the application: What if the content container hasn't rendered yet? Our scroll would fail.
Selenium provides WebDriverWait() to help you wait for the application to be in a particular state. It also provides, via its expected_conditions module, a set of common states to wait for, such as the presence of an element. We can use both of these to wait for the content container to be present:
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
TIMEOUT_IN_SECONDS = 10
wait = WebDriverWait(driver, TIMEOUT_IN_SECONDS)
wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#contents")))
Determining whether additional content was fetched
At a high level, we can determine whether additional content was fetched by:
counting the content before we trigger the fetch,
counting the content after we trigger the fetch, then
comparing the two.
Counting the content
Within our container (with id "#contents"), each piece of content has id #content. To count the content, we can simply fetch each of those elements and use Python's built-in len():
count = len(driver.find_elements_by_css_selector("#contents #content")
Handling a slow render
But again, we're in a race with the application: What happens if either the fetch or the render of additional content is slow? We won't immediately see it.
We need to give the web application time to do its thing. To do this, we can use WebDriverWait() with a custom condition:
def get_count():
return len(driver.find_elements_by_css_selector("#contents #content"))
count = get_count()
# ...
wait.until(
lambda _: get_count() > count)
Handling no additional content
But what if there isn't any additional content? Our wait for the count to increase will timeout.
As long as our timeout is high enough to allow sufficient time for the additional content to appear, we can assume that there is no additional content and ignore the timeout:
try:
wait.until(
lambda _: get_count() > count)
except TimeoutException:
# No additional content appeared. Abort our loop.
break
Putting it all together
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
TIMEOUT_IN_SECONDS = 10
wait = WebDriverWait(driver, TIMEOUT_IN_SECONDS)
driver.get(URL)
wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#contents")))
def get_count():
return len(driver.find_elements_by_css_selector("#contents #content"))
while True:
count = get_count()
driver.execute_script(
"window.scrollTo(0, document.querySelector('#contents').scrollHeight);")
try:
wait.until(
lambda _: get_count() > initial_count)
except TimeoutException:
# No additional content appeared. Abort our loop.
break
elements = driver.find_elements_by_css_selector("#contents #content")
Bonus: Simplifying with capybara-py
With capybara-py, this becomes a bit simpler:
import capybara
from capybara.dsl import page
from capybara.exceptions import ExpectationNotMet
#capybara.register_driver("selenium_chrome")
def init_selenium_chrome_driver(app):
from capybara.selenium.driver import Driver
return Driver(app, browser="chrome")
capybara.current_driver = "selenium_chrome"
capybara.default_max_wait_time = 10
page.visit(URL)
contents = page.find("#contents")
elements = []
while True:
try:
elements = contents.find_all("#content", minimum=len(elements) + 1)
except ExpectationNotMet:
# No additional content appeared. Abort our loop.
break
page.execute_script(
"window.scrollTo(0, arguments[0].scrollHeight);", contents)

How to get all the data from a webpage manipulating lazy-loading method?

I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)
counter = 0
while True:
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()
counter += 1
except IndexError:
break
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
name = elems.find_element_by_css_selector('h4[title] a').text
price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
print(name, price)
driver.back()
driver.quit()

I guess you could use Selenium for this but if speed is your concern aften #Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.
If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:
categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()
For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:
bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]
Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.
This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):
items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'
for uri in uris:
r = requests.get(items_API.format(uri)).json()
products = r['products']
for product in products:
name = product['title']
# testing for promo_price - if its 0.0 go with the normal price
price = product['pricing']['promo_price']
if price == 0.0:
price = product['pricing']['price']
print("Name: {}. Price: {}".format(name, price))
Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height

python selenium - how can I get all visible text on the page (i.e., not the page source) [duplicate]

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.