My code is good for the most part
I currently get all the titles from a youtube page + do a scroll.
How would I get the number of views?
Would CSS or xPath work?
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
driver = webdriver.Chrome(ChromeDriverManager().install())
url='https://www.youtube.com/user/OakDice/videos'
driver.get(url)
last_height = driver.execute_script("return document.documentElement.scrollHeight")
SCROLL_PAUSE_TIME = 2
while True:
# Scroll down to bottom
time.sleep(2)
driver.execute_script("window.scrollTo(0, arguments[0]);", last_height)
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
content=driver.page_source.encode('utf-8').strip()
soup=BeautifulSoup(content,'lxml')
titles = soup.findAll('a', id='video-title')
for title in titles:
print(title.text)
It would probably be more robust to use the YouTube API to get JSON data about the videos. You can get a list of all public videos uploaded by a given user (see for instance YouTube API to fetch all videos on a channel), and then you use the videos API to get the statistics for each video in the playlist and get the view count from statistics.viewCount.
I would loop through all the videos (parent tag ytd-grid-video-renderer)
and then pluck out the titles & counts from there.
Something like:
allvideos = driver.find_element_by_tag_name('driytd-grid-video-renderer')
for video in allvideos:
title = video.find_element_by_id('video-title')
count = video.find_element_by_xpath('//*[#id='metadata-line']/span')
print (title, count)
I don't have a beautiful soup solution for you, as selenium will do most of the work for you.
And a word of caution on using driver.page_source, it doesn't really return a full snapshot of the DOM, so it probably isn't doing what you think it's doing.
Related
So I've been using Selenium in Chrome to go to a social media profile and scrape the usernames of its followers. However, the list is in the 100s of thousands and the page only loads a limited amount. My solution was to tell Selenium to scroll down endlessly and scrape usernames using 'driver.find_elements' as it goes, but after a few hundred usernames Chrome soon crashes with the error code "Ran out of memory".
Am I even capable of getting that entire list?
Is Selenium even the right tool to use or should I use Scrapy? Maybe both?
I'm at a loss on how to move forward from here.
Here's my code just in case
from easygui import *
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService;
from webdriver_manager.chrome import ChromeDriverManager;
choice = ccbox("Run the test?","",("Run it","I'm not ready yet"));
if choice == False:
quit()
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()));
time.sleep(60) #this is a wait to give me time to manually log in and go
#to followers list
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
driver.execute_script("window.scrollTo(0, 1080);")
time.sleep(1)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
last_height = new_height
I figured it out! So every "follower" had an element and endlessly scrolling would store all of these elements in memory until it hit a limit. I solved this by deleting the elements with javascript after scrolling a certain amount, rinse and repeat until reaching the bottom :)
I want to scrape all matches links from this page 'https://m.aiscore.com/basketball/20210610' but can get only limiter numberof matches:
I tried this code :
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(60)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()
When I run the code, I get always 16 matches links only, but the page has 35 matches.
I need to get allthe matches links in the page.
As the site is being loaded when scrolled, I have tried to Scroll one screen at a time until the height we need to scroll to is larger than the total scroll height of the window.
I have used a set for storing the match links to avoid adding already existing match links.
At the time of running this, I was able to find all the links. Hope this will work for you as well.
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver.exe", options=options)
url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
# Wait till the webpage is loaded
time.sleep(2)
# wait for 1sec after scrolling
scroll_wait = 1
# Gets the screen height
screen_height = driver.execute_script("return window.screen.height;")
driver.implicitly_wait(60)
# Number of scrolls. Initially 1
ScrollNumber = 1
# Set to store all the match links
ans = set()
while True:
# Scrolling one screen at a time until
driver.execute_script(f"window.scrollTo(0, {screen_height * ScrollNumber})")
ScrollNumber += 1
# Wait for some time after scroll
time.sleep(scroll_wait)
# Updating the scroll_height after each scroll
scroll_height = driver.execute_script("return document.body.scrollHeight;")
# Fetching the data that we need - Links to Matches
soup = BeautifulSoup(driver.page_source, 'html.parser')
for j in soup.select('.w100 .flex a'):
if j['href'] not in ans:
ans.add(j['href'])
# Break when the height we need to scroll to is larger than the scroll height
if (screen_height) * ScrollNumber > scroll_height:
break
print(f'Links found: {len(ans)}')
Output:
Links found: 61
You're not adding any implicated waits into your code. You might want to start there. But try using driver.find_elements_by_link_text() in addition to adding some sleep time, that should create a list for you.
I'm trying to scrape a website (https://harleytherapy.com/therapists?page=1) that looks like it's been generated by Javascript and the element I'm trying to scrape (the lu with id="downshift-7-menu") doesn't appear on the "Page source" but only after I click on "Inspect element".
I tried to find a solution here and so far this the code I was able to come up with (a combination of Selenium + Beautiful soup)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
url = "https://harleytherapy.com/therapists?page=1"
options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get(url)
time.sleep(15)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'html')
therapist_menu_id = "downshift-7-menu"
print(soup.find(id=therapist_menu_id))
I thought that allowing Selenium to wait for 15 seconds would make sure that all elements are loaded but I still can't find any element with id downshift-7-menu in the soup. Do you guys know what's wrong with my code?
The element with ID downshift-7-menu is loaded only after opening the THERAPIST dropdown menu, you can do it by scrolling it into view to load it and then clicking on it. You should also consider replacing sleep with explicit wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 15)
# scroll the dropdown into view to load it
side_menu = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'inner-a377b5')))
last_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
while True:
driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", side_menu)
new_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
if new_height == last_height:
break
last_height = new_height
# open the menu
wait.until(EC.visibility_of_element_located((By.ID, 'downshift-7-input'))).click()
# wait for the option to load
therapist_menu_id = 'downshift-7-menu'
wait.until(EC.presence_of_element_located((By.ID, therapist_menu_id)))
print(soup.find(id=therapist_menu_id))
I want to scrape all tweets from twitter using Selenium. So, for this I want to go at bottom of the page.I tried a lot but it shows "Back to top " as shown in image.
How can I go at the bottom of the page/disappear "Back to top" using Selenium or How can I scrape all the tweets, if applying any other approach?
import pandas as pd
import selenium
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Firefox(executable_path="/home/piyush/geckodriver")
url="https://twitter.com/narendramodi"
driver.get(url)
time.sleep(6)
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(6)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
soup=BeautifulSoup(driver.page_source.encode("utf-8"),"html.parser")
tweet=[p.text for p in soup.find_all("p",class_="tweet-text")]
Here is image of inspect element of "Back-to-top"
Here is the output image
Just briefly looking at twitter, it appears that the content is generated on scrolling, meaning you need to scrape and parse the data as you scroll rather than after.
I would suggest moving
soup = BeautifulSoup(driver.page_source.encode("utf-8"),"html.parser")
tweet = [p.text for p in soup.find_all("p",class_="tweet-text")]
into your while loop after the scroll:
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
###
soup = BeautifulSoup(driver.page_source.encode("utf-8"),"html.parser")
tweet = [p.text for p in soup.find_all("p",class_="tweet-text")]
###
time.sleep(6)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
If this doesn't work you are probably being fingerprinted and labeled as a bot by Twitter.
I've written a scraper in python in combination with selenium to get all the product names from redmart.com. Every time I run my code, i get only 27 names from that page although the page has got numerous names. FYI, the page has got lazy-loading method enabled. My scraper can reach the bottom of the page but scrape only 27 names. I can't understand where I'm getting lost with the logic I've applied in my scraper. Hope to get any workaround.
Here is the script I've written so far:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://redmart.com/new")
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
wait.until(lambda driver: driver.execute_script("return document.body.scrollHeight;") > check_height)
check_height = driver.execute_script("return document.body.scrollHeight;")
except:
break
for names in driver.find_elements_by_css_selector('.description'):
item_name = names.find_element_by_css_selector('h4 a').text
print(item_name)
driver.quit()
You have to wait for new content to be loaded.
Here is a very simple example:
driver.get('https://redmart.com/new')
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 18 products
driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(5) # wait for new content to be loaded
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 36 products
It works.
You can also look at XHR requests and try to scrape anything You want without using "time.sleep()" and "driver.execute_script".
For example, while scrolling their website, new products are loaded from this URL:
https://api.redmart.com/v1.6.0/catalog/search?q=new&pageSize=18&page=1
As you can see, it is possible to modify parameters like pageSize (max 100 products) and page. With this URL you can scrape all products without even using Selenium and Chrome. You can do all of this with Python Requests