I have a working code that is able to access Tenor.com, scroll through the website and scrape gifs. But my issue is that it only scrapes and saves upto 24 gifs (no matter how many it scrolls past).
This exact same code works for saving images on other websites (without the same issues presented here).
I've also tried using BeautifulSoup to find all divs with the class "Gif " and then extract the img from each class. But that leads to the exact same result (only 24 gifs being downloaded).
Heres my code and output below. What might the issue be?
Output
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import requests
from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request
options = Options()
options.add_experimental_option("detach", True)
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
search_url = 'https://tenor.com/'
driver.get(search_url)
time.sleep(5) # Allow 7 seconds for the web page to open
scroll_pause_time = 2 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
screen_height = driver.execute_script("return window.screen.height;") #get the screen height of the web
i = 1
start_time = time.time()
while True:
if time.time() - start_time >= 60:
break
# scroll one screen height each time
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
# update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
scroll_height = driver.execute_script("return document.body.scrollHeight;")
# Break the loop when the height we need to scroll to is larger than the total scroll height
if (screen_height) * i > scroll_height:
break
media = []
media_elements = driver.find_elements(By.XPATH,"//div[contains(#class,'Gif ')]//img")
for m in media_elements:
src = m.get_attribute("src")
media.append(src)
print("Total Number of Animated GIFs and Videos Stored is", len(media))
print("The Sequence of Pages we Have Scrolled is", i)
for i in range(len(media)):
urllib.request.urlretrieve(str(media[i]),"tenor/media{}.gif".format(i))
If you scroll down with the DevTools opened, you can see that the number of figure elements doesn't increase after a certain quantity, i.e. old images are removed from the html as new ones are added.
So you have to run .get_attribute("src") inside the scrolling loop. Also, I suggest you using a set instead of a list to save the urls, since by running set.add(url) the url is added only if is not already contained in the set.
The code below scrape the images, get the urls and scroll to the last visible image.
media = set()
for i in range(6):
images = driver.find_elements(By.XPATH,"//div[contains(#class,'Gif ')]//img")
[media.add(img.get_attribute('src')) for img in images]
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(1)
Related
i'm trying to take a screenshot of product detail of Amazon item. I found that div id = aplus is the product detail description which is i'm looking for.
So i create code using python and selenium to take the full screen shot of the div section.
However, the result is cropped and only shows partial top of div.
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome()
URL = "https://www.amazon.co.jp/-/en/Figuarts-Dragon-Saiyan-Approx-Painted/dp/B08S7KVHMP/ref=sr_1_1?crid=3O3TF6V9FJHS5¤cy=JPY&keywords=b08s7kvhmp&qid=1668143838&qu=eyJxc2MiOiIwLjAwIiwicXNhIjoiMC4wMCIsInFzcCI6IjAuMDAifQ%3D%3D&sprefix=%2Caps%2C140&sr=8-1"
driver.get(URL)
time.sleep(5)
S = lambda X: driver.execute_script('return document.body.parentNode.scroll' +X)
time.sleep(1)
driver.set_window_size(S('Width'), S('Height'))
image = driver.find_element('id','aplus')
image.screenshot('yes.png')
and if i put
options=options
inside webdriver.Chrome(), depending on product it takes full screenshot of the div, but it does not contain any image.
I have no idea how to take full screenshot of the div :S
This example you need import the library PIL.
pip install Pillow
from selenium import webdriver
from PIL import Image
from io import BytesIO
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome()
URL = "https://www.amazon.co.jp/-/en/Figuarts-Dragon-Saiyan-Approx-Painted/dp/B08S7KVHMP/ref=sr_1_1?crid=3O3TF6V9FJHS5¤cy=JPY&keywords=b08s7kvhmp&qid=1668143838&qu=eyJxc2MiOiIwLjAwIiwicXNhIjoiMC4wMCIsInFzcCI6IjAuMDAifQ%3D%3D&sprefix=%2Caps%2C140&sr=8-1"
driver.get(URL)
# now that we have the preliminary stuff out of the way time to get that image :D
element = options.find_element_by_id('aplus') # find part of the page you want image of
location = element.location
size = element.size
png = options.get_screenshot_as_png() # saves screenshot of entire page
options.quit()
im = Image.open(BytesIO(png)) # uses PIL library to open image in memory
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image
I am trying to scrape a website that populates a list of providers. the site makes you go through a list of options and then finally it populates a list of providers through a pop up that has an endless/continuous scroll.
i have tried:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("my-id")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
but this code didn't work.
I tried something similar to this:
driver.execute_script("arguments[0].scrollIntoView();", list )
but this didnt move anything. it just stayed on the first 20 providers.
i tried this alternative:
main = driver.find_element_by_id('mainDiv')
recentList = main.find_elements_by_class_name('nameBold')
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView(true);", list)
time.sleep(20)
but ended up with this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
The code that worked the best was this one:
while True:
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
but this is an endless scroll that i dont know how to stop since "while True:" will always be true.
Any help with this would be great and thanks in advance.
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
import pandas as pd
PATH = '/Users/AnthemScraper/venv/chromedriver'
driver = webdriver.Chrome(PATH)
#location for the website
driver.get('https://shop.anthem.com/sales/eox/abc/ca/en/shop/plans/medical/snq?execution=e1s13')
print(driver.title)
#entering the zipcode
search = driver.find_element_by_id('demographics.zip5')
search.send_keys(90210)
#making the scraper sleep for 5 seconds while the page loads
time.sleep(5)
#entering first name and DOB then hitting next
search = driver.find_element_by_id('demographics.applicants0.firstName')
search.send_keys('juelz')
search = driver.find_element_by_id('demographics.applicants0.dob')
search.send_keys('01011990')
driver.find_element_by_xpath('//*[#id="button/shop/getaquote/next"]').click()
#hitting the next button
driver.find_element_by_xpath('//*[#id="hypertext/shop/estimatesavings/skipthisstep"]').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
#clicking the no option to view all the health plans
driver.find_element_by_xpath('//*[#id="radioNoID"]').click()
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
driver.find_element_by_xpath('//*[#id="hypertext/shop/medical/showmemydoctorlink"]/span').click()
time.sleep(2)
#section to choose the specialist. here we are choosing all
find_specialist=\
driver.find_element_by_xpath('//*[#id="specializedin"]')
#this is the method for a dropdown
select_provider = Select(find_specialist)
select_provider.select_by_visible_text('All Specialties')
#choosing the distance. Here we click on 50 miles
choose_mile_radius=\
driver.find_element_by_xpath('//*[#id="distanceInMiles"]')
select_provider = Select(choose_mile_radius)
select_provider.select_by_visible_text('50 miles')
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#handling the endless scroll
while True:
time.sleep(20)
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
#block below allows us to grab the majority of the data. we would have to split it up in pandas since this info
#is nested in with classes
time.sleep(5)
main = driver.find_element_by_id('mainDiv')
sections = main.find_elements_by_class_name('firstRow')
pcp_info = []
#print(section.text)
for pcp in sections:
#the site stores the information inside inner classes which make it difficult to scrape.
#the solution would be to pull the entire text in the block and hope to clean it aftewards
#innerText allows to pull just the text inside the blocks
first_blox = pcp.find_element_by_class_name('table_content_colone').get_attribute('innerText')
second_blox = pcp.find_element_by_class_name('table_content_coltwo').get_attribute('innerText')
#creating columns and rows and assigning them
pcp_items = {
'first_block' : [first_blox],
'second_block' : [second_blox]
}
pcp_info.append(pcp_items)
df = pd.DataFrame(pcp_info)
print(df)
df.to_csv('yerp.csv',index=False)
#driver.quit()
I'm trying to make corpus of comments on a certain video of youtube with selenium and BeautifulSoup. (I'm not trying to use Youtube Data api, because of the limit.)
and i almost did it but i could have got the result with only comments and ids...
i checked the space which contain the like counts info then i gave it into my code, it goes well anyway, but it does not retrieve the result, it gives me just nothing........ idk why......
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import re
from collections import Counter
from konlpy.tag import Twitter
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='C:\chrome\chromedriver_win32\chromedriver.exe', options=options)
url = 'https://www.youtube.com/watch?v=D4pxIxGdR_M&t=2s'
driver.get(url)
driver.implicitly_wait(10)
SCROLL_PAUSE_TIME = 3
# Get scroll height
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'lxml')
ids = soup.select('div#header-author > a > span')
comments = soup.select('div#content > yt-formatted-string#content-text')
likes = soup.select('ytd-comment-action-buttons-renderer#action-buttos > div#tollbar > span#vote-count-middle')
print('ID :', len(ids), 'Comments : ', len(comments), 'Likes : ' ,len(likes))
and 0 is just printed out... i have searched some of the ways to deal with it, but most of the answers were just to make me use the api.
I actually wouldn't use BeautifulSoup for the extraction, just go with the built-in selenium tools, i.e.:
ids = driver.find_elements_by_xpath('//*[#id="author-text"]/span')
comments = driver.find_elements_by_xpath('//*[#id="content-text"]')
likes = driver.find_elements_by_xpath('//*[#id="vote-count-middle"]')
This way you still can use len() due to them being iterables. You're also allowed to iterate over the variable likes and get the .text value to add them together:
total_likes = 0
for like in likes:
total_likes += int(like.text)
To get this more pythonic you might as well go with a proper list comprehension.
Not sure if entirely lines up with the requirement, but sites like socialbalde and moreofit can be combined to get a general overview and might be a better starting point of crawling, depending on your purpose. I personally found this to be useful.
I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)
counter = 0
while True:
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()
counter += 1
except IndexError:
break
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
name = elems.find_element_by_css_selector('h4[title] a').text
price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
print(name, price)
driver.back()
driver.quit()
I guess you could use Selenium for this but if speed is your concern aften #Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.
If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:
categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()
For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:
bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]
Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.
This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):
items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'
for uri in uris:
r = requests.get(items_API.format(uri)).json()
products = r['products']
for product in products:
name = product['title']
# testing for promo_price - if its 0.0 go with the normal price
price = product['pricing']['promo_price']
if price == 0.0:
price = product['pricing']['price']
print("Name: {}. Price: {}".format(name, price))
Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
I have been using Python Selenium for quite some time and I have been happy with it until I got this new requirement which I am supposed to set sliders on a web-page (here) to certain values and then let the page run its scripts to update the page with the results.
My problem is how to set the slider min and max knobs () using Python Selenium. I have the tried the example here and my code is below.
#! /usr/bin/python2.7
import os
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
import datetime
import time
import mysql.connector
def check2(driver, slidebar, sliderknob, percent):
height = slidebar.size['height']
width = slidebar.size['width']
move = ActionChains(driver);
# slidebar = driver.find_element_by_xpath("//div[#id='slider']/a")
if width > height:
#highly likely a horizontal slider
print "off set: ", percent * width / 100
move.click_and_hold(sliderknob).move_by_offset(500, 0).release().perform()
else:
#highly likely a vertical slider
move.click_and_hold(sliderknob).move_by_offset(percent * height / 100, 0).release().perform()
driver.switch_to_default_content()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-proxy-server')
os.environ["PATH"] += ":/home/mike/software"
os.environ["PATH"] += ":/usr/local/bin/"
try:
driver = webdriver.Chrome()
driver.get("http://99.243.40.11/#/HouseSold")
els = driver.find_elements_by_xpath('//input[#class="input high"]')
print 'els.len = ', len(els)
e = els[0]
ens = driver.find_elements_by_xpath('//span[#class="pointer high"]')
en = ens[0]
check2(driver, e, en, 70)
time.sleep(20)
finally:
driver.close()
Unfortunately not working for me.
Please let me know if you know of any clue.
Much appreicate your help.
Regards,
Well I think you can follow last comment's and it will give you the clue.
Actually I did and got some good results. First you need use Selenium IDE to find the knob you like to move and then do sth like below to move it like below.
Let me know if that helps you.
Cheers,
try:
driver = webdriver.Chrome()
driver.get("http://99.243.40.11/#/HouseSold")
en = driver.find_element_by_xpath("//span[6]")
move = ActionChains(driver)
move.click_and_hold(en).move_by_offset(10, 0).release().perform()
time.sleep(5)
move.click_and_hold(en).move_by_offset(10, 0).release().perform()
time.sleep(5)
move.click_and_hold(en).move_by_offset(10, 0).release().perform()
time.sleep(5)
finally:
driver.close()