how to crawl like counts of comments on youtube video? - python

I'm trying to make corpus of comments on a certain video of youtube with selenium and BeautifulSoup. (I'm not trying to use Youtube Data api, because of the limit.)
and i almost did it but i could have got the result with only comments and ids...
i checked the space which contain the like counts info then i gave it into my code, it goes well anyway, but it does not retrieve the result, it gives me just nothing........ idk why......
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import re
from collections import Counter
from konlpy.tag import Twitter
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='C:\chrome\chromedriver_win32\chromedriver.exe', options=options)
url = 'https://www.youtube.com/watch?v=D4pxIxGdR_M&t=2s'
driver.get(url)
driver.implicitly_wait(10)
SCROLL_PAUSE_TIME = 3
# Get scroll height
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
html_source = driver.page_source
driver.close()
soup = BeautifulSoup(html_source, 'lxml')
ids = soup.select('div#header-author > a > span')
comments = soup.select('div#content > yt-formatted-string#content-text')
likes = soup.select('ytd-comment-action-buttons-renderer#action-buttos > div#tollbar > span#vote-count-middle')
print('ID :', len(ids), 'Comments : ', len(comments), 'Likes : ' ,len(likes))
and 0 is just printed out... i have searched some of the ways to deal with it, but most of the answers were just to make me use the api.

I actually wouldn't use BeautifulSoup for the extraction, just go with the built-in selenium tools, i.e.:
ids = driver.find_elements_by_xpath('//*[#id="author-text"]/span')
comments = driver.find_elements_by_xpath('//*[#id="content-text"]')
likes = driver.find_elements_by_xpath('//*[#id="vote-count-middle"]')
This way you still can use len() due to them being iterables. You're also allowed to iterate over the variable likes and get the .text value to add them together:
total_likes = 0
for like in likes:
total_likes += int(like.text)
To get this more pythonic you might as well go with a proper list comprehension.

Not sure if entirely lines up with the requirement, but sites like socialbalde and moreofit can be combined to get a general overview and might be a better starting point of crawling, depending on your purpose. I personally found this to be useful.

Related

Scraping Gifs Using Selenium

I have a working code that is able to access Tenor.com, scroll through the website and scrape gifs. But my issue is that it only scrapes and saves upto 24 gifs (no matter how many it scrolls past).
This exact same code works for saving images on other websites (without the same issues presented here).
I've also tried using BeautifulSoup to find all divs with the class "Gif " and then extract the img from each class. But that leads to the exact same result (only 24 gifs being downloaded).
Heres my code and output below. What might the issue be?
Output
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import requests
from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request
options = Options()
options.add_experimental_option("detach", True)
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
search_url = 'https://tenor.com/'
driver.get(search_url)
time.sleep(5) # Allow 7 seconds for the web page to open
scroll_pause_time = 2 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
screen_height = driver.execute_script("return window.screen.height;") #get the screen height of the web
i = 1
start_time = time.time()
while True:
if time.time() - start_time >= 60:
break
# scroll one screen height each time
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
# update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
scroll_height = driver.execute_script("return document.body.scrollHeight;")
# Break the loop when the height we need to scroll to is larger than the total scroll height
if (screen_height) * i > scroll_height:
break
media = []
media_elements = driver.find_elements(By.XPATH,"//div[contains(#class,'Gif ')]//img")
for m in media_elements:
src = m.get_attribute("src")
media.append(src)
print("Total Number of Animated GIFs and Videos Stored is", len(media))
print("The Sequence of Pages we Have Scrolled is", i)
for i in range(len(media)):
urllib.request.urlretrieve(str(media[i]),"tenor/media{}.gif".format(i))
If you scroll down with the DevTools opened, you can see that the number of figure elements doesn't increase after a certain quantity, i.e. old images are removed from the html as new ones are added.
So you have to run .get_attribute("src") inside the scrolling loop. Also, I suggest you using a set instead of a list to save the urls, since by running set.add(url) the url is added only if is not already contained in the set.
The code below scrape the images, get the urls and scroll to the last visible image.
media = set()
for i in range(6):
images = driver.find_elements(By.XPATH,"//div[contains(#class,'Gif ')]//img")
[media.add(img.get_attribute('src')) for img in images]
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(1)

How to get a YouTube video's duration/length using Selenium and Python?

I am trying to extract the title, duration and the link of all the videos that a YT channel has. I used selenium and python in the following way:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
results = []
url = "https://www.youtube.com/channel/<channel name>/videos"
driver.get(url)
ht=driver.execute_script("return document.documentElement.scrollHeight;")
while True:
prev_ht=driver.execute_script("return document.documentElement.scrollHeight;")
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
time.sleep(2)
ht=driver.execute_script("return document.documentElement.scrollHeight;")
if prev_ht==ht:
break
links=driver.find_elements_by_xpath('//*[#class="style-scope ytd-grid-renderer"]')
for link in links:
print()
print(link.get_attribute("href"), link.get_attribute("text"))
When I try to get the duration of the video using class="style-scope ytd-thumbnail-overlay-time-status-renderer" class, the driver returns that the element doesn't exist. I managed the got the other two features though.
Your XPath locator is not correct, so please use the following:
links=driver.find_elements_by_xpath('//*[name() = "ytd-grid-video-renderer" and #class="style-scope ytd-grid-renderer"]')
Now, to get the videos length per each link you defined you can do the following:
links=driver.find_elements_by_xpath('//*[name() = "ytd-grid-video-renderer" and #class="style-scope ytd-grid-renderer"]')
for link in links:
duration = link.find_element_by_xpath('.//span[contains(#class,"time-status")]').text
print(duration)
Good Morning!
Selenium can have trouble getting the video duration if the cursor is not in the perfect spot. Here's a GIF to show that: Gif. You can get around this by using some of Youtube's built-in Javascript functions. Here's an example that uses this:
video_dur = self.driver.execute_script(
"return document.getElementById('movie_player').getCurrentTime()")
video_len = self.driver.execute_script(
"return document.getElementById('movie_player').getDuration()")
video_len = int(video_len) / 60
Have a great day!

Get multiple elements by tag with Python and Selenium

My code goes into a website and scrapes rows of information (title and time).
However, there is one tag ('p') that I am not sure how to get using 'get element by'.
On the website, it is the information under each title.
Here is my code so far:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome()
driver.get('https://www.nutritioncare.org/ASPEN21Schedule/#tab03_19')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
eachRow = driver.find_elements_by_class_name('timeline__item')
time.sleep(1)
for item in eachRow:
time.sleep(1)
title = item.find_element_by_class_name('timeline__item-title')
tim = item.find_element_by_class_name('timeline__item-time')
tex = item.find_element_by_tag_name('p') # This is the part I don’t know how to scrape
print(title.text, tim.text, tex.text)
I checked the page and there are several p tags, I suggest to use find_elements_by_tag_name instead of find_element_by_tag_name (to get all the p tags including the p tag that you want) and iterate over all the p tags elements and then join the text content and do strip on it.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
driver.get('https://www.nutritioncare.org/ASPEN21Schedule/#tab03_19')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
eachRow = driver.find_elements_by_class_name('timeline__item')
time.sleep(1)
for item in eachRow:
time.sleep(1)
title=item.find_element_by_class_name('timeline__item-title')
tim=item.find_element_by_class_name('timeline__item-time')
tex=item.find_elements_by_tag_name('p')
text = " ".join([i.text for i in tex]).strip()
print(title.text,tim.text, text)
Since the webpage has several p tags, it would be better to use the .find_elements_by_class() method. Replace the print call in the code with the following:
print(title.text,tim.text)
for t in tex:
if t.text == '':
continue
print(t.text)
Maybe try using different find_elements_by_class... I don't use Python that much, but try this unless you already have.

How to scroll div to get all dynamically loading items?

I'm trying to scrap site https://cs.money to get all items and prices but my script loading only first 180 skins and I have no idea how to load all items. Can someone give me a tip what I should use to load all the items and what is best approach to do this?
from selenium import webdriver
import time
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://cs.money/en')
time.sleep(5)
asd = driver.find_elements_by_class_name("item")
qwe = []
for a in asd:
if a.get_attribute("ar"):
qwe.append([a.get_attribute("hash"), a.get_attribute("cost"), a.get_attribute("ar")])
else:
qwe.append([a.get_attribute("hash"), a.get_attribute("cost"), ])
driver.close()
lables = ['name', 'price', 'float_bonus']
dataas = pd.DataFrame.from_records(qwe, columns=lables)
Instead of time.sleep(5) you could add:
for i in range(0,5): # here you will need to tune to see exactly how many scrolls you need
driver.execute_script('window.scrollBy(0, 400)')
time.sleep(1)
The above is a general solution when you need to scroll dynamic content on a page.
In your case I think the best approach, time wise, would be to scroll into view always the last element you could use:
for i in range(0,25): # here you will need to tune to see exactly how many scrolls you need
driver.execute_script('items = document.querySelectorAll(".item");i = items[items.length-1];i.scrollIntoView();')
this is the JS snippet you can try in the browser console:
items = document.querySelectorAll(".item");i = items[items.length-1];i.scrollIntoView();

How to get all the data from a webpage manipulating lazy-loading method?

I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)
counter = 0
while True:
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()
counter += 1
except IndexError:
break
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
name = elems.find_element_by_css_selector('h4[title] a').text
price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
print(name, price)
driver.back()
driver.quit()
I guess you could use Selenium for this but if speed is your concern aften #Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.
If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:
categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()
For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:
bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]
Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.
This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):
items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'
for uri in uris:
r = requests.get(items_API.format(uri)).json()
products = r['products']
for product in products:
name = product['title']
# testing for promo_price - if its 0.0 go with the normal price
price = product['pricing']['promo_price']
if price == 0.0:
price = product['pricing']['price']
print("Name: {}. Price: {}".format(name, price))
Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height

Categories

Resources