Not scrolling down in a website having dynamic scroll - python

I'm scraping news-articles from a website where there is no load-more button in a specific category page, the news article links are being generated as I scroll down. I wrote a function which take input category_page_url and limit_page(how many times I want to scroll down) and return me back all the links of the news articles displayed in that page.
Category page link = https://www.scmp.com/topics/trade
def get_article_links(url, limit_loading):
options = webdriver.ChromeOptions()
lists = ['disable-popup-blocking']
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "normal"
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-extensions")
options.add_argument("--disable-notifications")
options.add_argument("--disable-Advertisement")
options.add_argument("--disable-popup-blocking")
driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
loading = 0
while loading < limit_loading:
loading += 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(8)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
article_links = []
bsObj = BeautifulSoup(driver.page_source, 'html.parser')
for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
article_links.append(i.a['href'])
return article_links
Assuming I want to scroll 5 times in this category page,
get_article_links('https://www.scmp.com/topics/trade', 5)
But even if I change the number of my limit_page it return me back only the links from first page, there is some mistake I've done to write the scrolling part. Please help me with this.

Instead of scrolling using per body scrollHeight property, I checked to see if there was any appropriate element after the list of articles to scroll to. I noticed this appropriately named div:
<div class="topic-content__load-more-anchor" data-v-db98a5c0=""></div>
Accordingly, I primarily changed the while loop in your function get_article_links to scroll to this div using location_once_scrolled_into_view after finding the div before the loop starts, as follows:
loading = 0
end_div = driver.find_element('class name','topic-content__load-more-anchor')
while loading < limit_loading:
loading += 1
print(f'scrolling to page {loading}...')
end_div.location_once_scrolled_into_view
time.sleep(2)
If we now call the function with different limit_loading, we get different count of unique news links. Here are couple of runs:
>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 2)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...
90
>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 3)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...
scrolling to page 3...
120

Related

Get 'src' link from image using selenium

My problem is that I am trying to find a way to get the link of youtube thumbnails using selenium. What I found online does not help at all it suggested me to do: .get_attribute("src")' which does not work.
I tried this (everything works if I remove '.get_attribute("src")' *well, I do not get any errors and I am not capable of getting the thumbnails either):
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.youtube.com/#MrBeast/videos")
SCROLL_PAUSE_TIME = 3
last_height = driver.execute_script("return document.documentElement.scrollHeight")
n=0
while n<4:
#Scroll down to bottom
driver.execute_script("window.scrollTo(0, arguments[0]);", last_height);
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
n += 1
titles = driver.find_elements(By.ID, "video-title")
views = driver.find_elements(By.XPATH, '//*[#id="metadata-line"]/span[1]')
year = driver.find_elements(By.XPATH,'//*[#id="metadata-line"]/span[2]')
thumbnail = driver.find_elements(By.XPATH, '//*[#id="thumbnail"]/yt-image/img').get_attribute("src")
data = []
for i,j,k,l in zip(titles, views, year, thumbnail):
data.append([i.text, j.text, k.text, l.text])
df = pd.DataFrame(data, columns = ['Title', 'views', 'date', 'thumbnail'])
df.to_csv('MrBeastThumbnails.csv')
driver.quit()
find_elements returns a list of web elements while .get_attribute() can be applied on single web element object only.
To get the src attribute values you need to iterate over a list of web elements extracting their src attributes, as following:
src_values = []
thumbnails = driver.find_elements(By.XPATH, '//*[#id="thumbnail"]/yt-image/img')
for thumbnail in thumbnails:
src_values.append(thumbnail.get_attribute("src"))

Can't get all xpath elements from dynamic webpage

First time here asking. Hope someone can help me with this, it's driving me crazy !
I'm trying to scrape a used-car webpage from my country. The data loads when you start to scroll down, so, the first part of the code is for scrolling down and load the webpage.
I'm trying to get the link of every car published here, that's why I'm using find_elements_by_xpath in the try-except part.
Well, the problem is, the cars are showed up in packs of 11 for every load(scroll down), so the 11 xpaths repeats when scrolling down everytime;
meaning xpaths from
"//*[#id='w1']/div[1]/div/div[1]/a"
to
"//*[#id='w11']/div[1]/div/div[1]/a"
All libraries are called at the start of the code, don't worry.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
links = []
url = ('https://buy.olxautos.cl/buscar?VehiculoEsSearch%5Btipo_valor%5D=1&VehiculoEsSearch%5Bprecio_range%5D=3990000%3B15190000')
driver = webdriver.Chrome('')
driver.get(url)
time.sleep(5)
SCROLL_PAUSE_TIME = 3
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
try:
zelda = driver.find_elements_by_xpath("//*[#id='w1']/div[1]/div/div[1]/a").get_attribute('href')
links.append(zelda)
except:
pass
print(links)
So the expected output of this code would be something like this:
['link_car_1', 'link_car_12', 'link_car_23', '...']
But when I run this code, it returns an empty list. But when I run it with find_element_by_xpath returns the first link, what am I doing wrong 😭😭, I just can't figure it out !!.
Thanks!
You get only one link because the XPATH is not the same for all the links. you can use bs4 to extract links by using the driver page source as shown below.
from bs4 import BeautifulSoup
import lxml
links = []
url = ('https://buy.olxautos.cl/buscar?VehiculoEsSearch%5Btipo_valor%5D=1&VehiculoEsSearch%5Bprecio_range%5D=3990000%3B15190000')
driver = webdriver.Chrome(executable_path = Path)
driver.get(url)
time.sleep(5)
SCROLL_PAUSE_TIME = 3
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
page_source_ = driver.page_source
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
#use BeautifulSoup to extract links
sup = BeautifulSoup(page_source_, 'lxml')
sub_ = sup.findAll('div', {'class': 'owl-item active'})
for link_ in sub_:
link = link_.find('a', href= True)
#link = 'https://buy.olxautos.cl' + link #if needed (adding prefix)
links.append(link['href'])
if new_height == last_height:
break
last_height = new_height
print('>> Total length of list : ', len(links))
print('\n',links)

Scrolling in a different layer in Chrome using Selenium in Python

I am writing a code in python using the module selenium, and I want to scroll on a list that is on a different layer in the same window. Imagine you go to Instagram, click on followers, and then wish to scroll down to the bottom so that selenium can make a list of all the users who follow that page.
My problem is my code scrolls on the layer below, which is the wall of the user.
def readingFollowers(self):
self.driver.find_element_by_xpath("//a[contains(#href, '/followers')]")\
.click()
sleep(2.5)
scroll_box = self.driver.find_element_by_xpath('/html/body/div[4]/div/div[2]')
# Get scroll height
last_height = self.driver.execute_script("return arguments[0].scrollHeight", scroll_box)
while True:
# Scroll down to bottom
self.driver.execute_script("window.scrollTo(0, arguments[0].scrollHeight);", scroll_box)
# Wait to load page
sleep(1)
# Calculate new scroll height and compare with last scroll height
new_height = self.driver.execute_script("return arguments[0].scrollHeight", scroll_box)
if new_height == last_height:
break
last_height = new_height
I have used Google Chrome, and the inspect element would be the same on all the systems (most probably).
For complete code, you can comment on me, in case you are not able to understand the problem. I can give you the code required to recreate the situation for better understanding.
I assume that you are already logged-in on the IG account.
def readingFollowers(self):
#click followers
self.driver.find_element_by_xpath('//a[#class="-nal3 "]').click()
time.sleep(5)
pop_up = driver.find_element_by_xpath('//div[#class="isgrP"]')
height = driver.execute_script("return arguments[0].scrollHeight", pop_up)
initial_height = height
#default follower count is 12
followers_count = 12
while True:
driver.execute_script("arguments[0].scrollBy(0,arguments[1])", pop_up, initial_height)
time.sleep(5)
#count loaded followers
count = len(driver.find_elements_by_xpath('//div[#class="PZuss"]/li'))
if count == followers_count:
break
followers_count = count
#add height because the list is expanding
initial_height+=initial_height
It took me some time but it works.

Scraping all comments under an Instagram post

my previous questions was closed, but the suggested answer doesn't help me. Instagram comments has a very specific behaviour! I know how to programatically scroll a website down, but with the comments on Instagram is a bit different! I would appreciate if my question was not closed immediately because it really doesn't help. Woule ba grateful for help and not shutting me down! Thank you.
Here it is again:
I am trying to build a scraper that is saving the comments under an Instagram post. I manage to log in to the instagram through my code so I can access all comments under a post, but I seem to cannot scroll down enough times to view all comments in order to scrape all of them. I only get around 20 comments everytime.
Can anyone please help me? I am using selenium webdriver.
Thank you for your help in advance! Will be greatfull.
This is my function for saving the comments:
import time
from selenium.webdriver.firefox.options import Options
from selenium.webdriver import Firefox
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
class Instagram_comments():
def __init__(self):
self.firefox_options = Options()
self.browser = Firefox(options=self.firefox_options)
def get_comments(self, url):
self.browser.get(url)
time.sleep(3)
while True:
try:
self.load_more_comments = self.browser.find_element_by_class_name(
'glyphsSpriteCircle_add__outline__24__grey_9')
self.action = ActionChains(self.browser)
self.action.move_to_element(self.load_more_comments)
self.load_more_comments.click()
time.sleep(4)
self.body_elem = self.browser.find_element_by_class_name('Mr508')
for _ in range(100):
self.body_elem.send_keys(Keys.END)
time.sleep(3)
except Exception as e:
pass
time.sleep(5)
self.comment = self.browser.find_elements_by_class_name('gElp9 ')
for c in self.comment:
self.container = c.find_element_by_class_name('C4VMK')
self.name = self.container.find_element_by_class_name('_6lAjh').text
self.content = self.container.find_element_by_tag_name('span').text
self.content = self.content.replace('\n', ' ').strip().rstrip()
self.time_of_post = self.browser.find_element_by_xpath('//a/time').get_attribute("datetime")
self.comment_details = {'profile name': self.name, 'comment': self.content, 'time': self.time_of_post}
print(self.comment_details)
time.sleep(5)
return self.comment_details
This chunk worked multiple times for me:
def scroll():
SCROLL_PAUSE_TIME = 1
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
print('page height:', new_height)
last_height = new_height
scroll()
and for some sites you will need to scrape as you scroll as not all elements will appear when you get to the bottom(such as twitter).
This is what my code looked like for twitter:
account_names = []
account_tags = []
account_link = []
def scroll():
SCROLL_PAUSE_TIME = 1
global account_name
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
account_name = driver.find_elements_by_xpath('//*[#id="react-root"]/div/div/div/main/div/div/div/div/div/div[2]/div/div/section/div/div/div/div/div/div/div/div[2]/div[1]/div[1]/a/div/div[1]/div[1]/span/span')
for act_name in account_name:
global acctname
acctname = act_name.text
account_names.append(acctname)
account_handle = driver.find_elements_by_xpath('//*[#id="react-root"]/div/div/div/main/div/div/div/div/div/div[2]/div/div/section/div/div/div/div/div/div/div/div[2]/div[1]/div[1]/a/div/div[2]/div/span')
for act_handle in account_handle:
global account_tags
acct_handles = act_handle.text
account_tags.append(acct_handles)
soup = BeautifulSoup(driver.page_source, 'lxml')
account_links = soup.find_all('a', href=True, class_='css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l')
for acct_links in account_links:
global act_link
act_link = acct_links['href']
account_link.append(act_link)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
scroll()
Just a note: as another user commented, Instagram is very difficult to scrape because of the dynamic html variables, so they would be correct to say no one, myself included is too interested in writing that for instagram.
The first function returns True if an element is found
The second function is used to scroll to the bottom till the last comment and then BeautifulSoup is used to scrape all comments
def check_exists_by_xpath(self,xpath):
try:
self.driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
def get_comments():
while self.check_exists_by_xpath("//div/ul/li/div/button"):
load_more_comments_element = self.driver.find_element_by_xpath("//div/ul/li/div/button")
load_more_comments_element.click()
sleep(1)
sleep(2)
soup = BeautifulSoup(self.driver.page_source,'lxml')
comms = soup.find_all('div',attrs={'class':'C4VMK'})
print(len(comms))
soup_2 = BeautifulSoup(str(comms),'lxml')
spans = soup_2.find_all('span')
comments = [i.text.strip() for i in spans if i != '']
print(comments)
I hope this helps - be aware I'm also still learning.
This worked for me, this programmatically clicks the "load more" button, as many times it is displayed.
try:
load_more_comment = driver.find_element_by_css_selector('.MGdpg > button:nth-child(1)')
print("Found {}".format(str(load_more_comment)))
while load_more_comment.is_displayed():
load_more_comment.click()
time.sleep(1.5)
load_more_comment = driver.find_element_by_css_selector('.MGdpg > button:nth-child(1)')
print("Found {}".format(str(load_more_comment)))
except Exception as e:
print(e)
pass

Selenium: scroll down of page and parse with python

I try to parse page ozon.ru
And I have some problem.
I should scroll the page and next get all html code.
But I scroll page, the height is changing, but results of parsing is wrong, because it returns result only from first page.
I can't understand, I should update html code of page and how can I do that?
def get_link_product_ozon(url):
chromedriver = "chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get(url)
i = 0
last_height = driver.execute_script("return document.body.scrollHeight")
while i < 80:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight")
i += 1
last_height = new_height
except:
time.sleep(3)
continue
soup = BeautifulSoup(driver.page_source, "lxml")
all_links = soup.findAll('div', class_='bOneTile inline jsUpdateLink mRuble ')
for link in all_links:
print(link.attrs['data-href'])
driver.close()
Those divs loaded after scrolling don't have class mRuble and you are doing exact string matching. Maybe try something like:
all_links = soup.select('div.bOneTile.inline.jsUpdateLink')
all_links = soup.select('div[data-href]')
...

Categories

Resources