Need to Scrape Paginated Pages in Python Selenium - python

I have a selenium / python script that scrapes page titles and some other information. At the bottom of the page is a "next" button along with some pagination that loads the next 20 results or so when I click next. This all happens without a page load. I need to be able to scrape the remaining pages until the "next" button is no longer visible, which indicates there are no more results to be loaded. Below is the logic I have so far to give you an idea. I have simplified it so it's easily followed. I can scrape the first page of titles, but once the browser clicks "next" the script terminates. How do I get it to scrape the remaining pages? Thanks!
#loads web page
browser.get("URL")
#scrapes titles
deal_title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
#clicks next button
browser.find_element_by_xpath("button xpath")
print(title)

You need a loop to repeat the process. This should work. And you might want to put sufficient sleep or waits to make sure all elements on the page gets loaded. Also may be try not to use Xpath as much. If you can target class or id that would be better.
from selenium.common.exceptions import NoSuchElementException
while True:
title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
try:
browser.find_element_by_xpath("xpath of the next button").click()
except NoSuchElementExeception :
break

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

I'm trying to set up a simple webscraping script to pull every hyperlink from the discover cards on Bandcamp.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
browser = webdriver.Chrome()
all_links = []
page = 1
url = "https://bandcamp.com/?g=all&s=new&p=0&gn=0&f=digital&w=-1"
browser.get(url)
while page < 6:
page += 1
# wait until discover cards are loaded
test = WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, '//*[#id="discover"]/div[9]/div[2]/div/div[1]/div/table/tbody/tr[1]/td[1]/a/div')))
# scrape hyperlinks for each of the 8 albums shown
titles = browser.find_elements(By.CLASS_NAME, "item-title")
links = [title.get_attribute('href') for title in titles[-8:]]
all_links = all_links + links
print(links)
# pagination - click through the page buttons as the links are scraped
page_nums = browser.find_elements(By.CLASS_NAME, 'item-page')
for page_num in page_nums:
if page_num.text.isnumeric():
if int(page_num.text) == page:
page_num.click()
time.sleep(20) # I've tried multiple long wait times as well as WebDriverWaits on different elements to see if the HTML will update, but I haven't seen a positive effect
break
I'm using print(links) to see where this is going wrong. In the selenium browser, it clicks through the pages well. Note that pagination via the url parameters doesn't seem possible as the discover cards often won't load unless you click the page buttons towards the bottom of my picture. BetterSoup and Requests don't work either for the same reason. The print function is returning the following:
['https://neubauten.bandcamp.com/album/stimmen-reste-musterhaus-7?from=discover-new', 'https://cirka1.bandcamp.com/album/time?from=discover-new', 'https://futuramusicsound.bandcamp.com/album/yoga-meditation?from=discover-new', 'https://deathsoundbatrecordings.bandcamp.com/album/real-mushrooms-dsbep092?from=discover-new', 'https://riacurley.bandcamp.com/album/take-me-album?from=discover-new', 'https://terracuna.bandcamp.com/album/el-origen-del-viento?from=discover-new', 'https://hyper-music.bandcamp.com/album/hypermusic-vol-4?from=discover-new', 'https://defisis1.bandcamp.com/album/priceless?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://finitysounds.bandcamp.com/album/kreme?from=discover-new', 'https://mylittlerobotfriend.bandcamp.com/album/amen-break?from=discover-new', 'https://electrinityband.bandcamp.com/album/rise?from=discover-new', 'https://abyssal-void.bandcamp.com/album/ritualist?from=discover-new', 'https://plataformarecs.bandcamp.com/album/v-a-david-lynch-experience?from=discover-new', 'https://hurricaneturtles.bandcamp.com/album/industrial-synth?from=discover-new', 'https://blackwashband.bandcamp.com/album/2?from=discover-new', 'https://worldwide-bitchin-records.bandcamp.com/album/wack?from=discover-new']
Each time it correctly pulls the first 8 albums on page 1, then for pages 2-4 it repeats the 8 albums on page 2, for pages 5-7 it repeats the 8 albums on page 5, and so on. Even though the page is updating (and the url changes) in the selenium browser, for some reason selenium is not recognizing any changes to the html so it repeats the same titles. Any idea where I've gone wrong?
Your definition of titles, i.e.
titles = browser.find_elements(By.CLASS_NAME, "item-title")
is a bad idea because item-title is the class of many elements in the page. Then another bad idea is to pick titles[-8:]. It may sounds good because you think ok each time I click a page the new elements are added at the end, but this is not always the case. Your case is one of those were elements are not added sequentially.
So let's start by considering a class exclusive of cards. For example discover-item. Then open the DevTools, press CTRL+F and enter .discover-item. When the url is first loaded, it will find 8 results. Now click next page, now it finds 16 results, click again and will find 24 results. To better see what's going I suggest you to run the following each time you click on the "next" button.
el = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
for i,e in enumerate(el):
print(i,e.get_attribute('innerText').replace('\n',' - '))
In particular, when arriving to page 3, you will see that the first item shown in page 1 (which in my case is 0 and friends - Jacob Stanley - alternative), is now printed at a different position (in my case 8 and friends - Jacob Stanley - alternative). What happened is that the items of page 3 were added at the beginning of the list, and so you can see why titles[-8:] was a bad choice.
So a better choice is to consider all cards each time you go to the next page, instead of the last 8 only (notice that the HTML of this site can contain no more than 24 cards), and then add all current cards to a set (since a set cannot contain duplicates, only new elements will be added).
# scroll to cards
cards = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".discover-item")))
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[0])
time.sleep(1)
items = set()
while 1:
links = driver.find_elements(By.CSS_SELECTOR, '.discover-item .item-title')
# extract info and store it
for idx,card in enumerate(cards):
tit_art_gen = card.get_attribute('innerText').replace('\n',' - ')
href = links[idx].get_attribute('href')
# print(idx, tit_art_gen)
items.add(tit_art_gen + ' - ' + href)
# click 'next' button if it is not disabled
next_button = driver.find_element(By.XPATH, "//a[.='next']")
if 'disabled' in next_button.get_attribute('class'):
print('last page reached')
break
else:
next_button.click()
# wait until new elements are loaded
cards_new = cards.copy()
while cards_new == cards:
cards_new = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
time.sleep(.5)
cards = cards_new.copy()

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Python/Selenium - How to webscrape this dropdown

My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.
But my click/scrape for the rows with these drop downs are not printing.
Expected output- Print all titles including the ones in dropdown.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
sessiontitle=property.find('h4',class_='session-title card-title').text
print(sessiontitle)
try:
ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
ifDropdown.click()
time.sleep(4)
newTitle=driver.find_element_by_class_name('card-title').text
print(newTitle)
except:
newTitle='none'
There were a couple of issues. First, when you locate from the driver by class and there is more than one, you need to separate them by dots, not spaces, so that the driver knows it's dealing with another class.
Second, find_elements returns a list, and the list has no .click(), so you get an error, which your except catches but assumes means there was no link to click.
I rewrote it (without soup for now) so that it instead checks (With the dot replacing space) for a link to open within the session and then loops over the new ones that appeared.
Here is what I have and tested. Note at the end this only gets the sessions and subsessions in the view. You will need to add logic to scroll and get the rest.
# stuff to initialize driver is above here, I used firefox
# Open the website page
URL = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list"
driver.get(URL)
time.sleep(4)#time for page to populate
product_list=driver.find_elements_by_css_selector('div.card.item-container.session')
#above line gets all top level sessions
for product in product_list:
session_title=product.find_element_by_css_selector('h4.card-title').text
print(session_title)
dropdowns=product.find_elements_by_class_name('item-expand-action.expand')
#above line finds dropdown within this session, if any
if len(dropdowns)==0:#nothing more for this session
continue#move to next session
#still here, click on the dropdown, using execute because link can overlap chevron
driver.execute_script("arguments[0].scrollIntoView(true); arguments[0].click();",
dropdowns[0])
time.sleep(4)#wait for subsessions to appear
session_titles=product.find_elements_by_css_selector('h4.card-title')
session_index = 0#suppress reprinting title of master session
for session_title in session_titles:
if session_index > 0:
print(" " + session_title.text)#indent for clarity
session_index = session_index + 1
#still to do, deal with other sessions that only get paged into view when you scroll
#that is a different question

Python- WebScraping a page

My code is supposed to go into a website, navigate through 2 pages, and print out all the titles and URL/href within each row.
Currently - My code goes into these 2 pages fine, however it only prints out the first title of each page and not each title of each row.
The page does have some JavaScript, and I think maybe this is why it does not show any links/urls/hrefs within each of these rows? Ideally id like to print the URLS of each row.
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/presentations/endometrial/{x}')
time.sleep(3)
page_source = driver.page_source
eachrow=driver.find_elements_by_xpath("//li[#class='result clearfix']")
for item in eachrow:
title=driver.find_element_by_xpath("//span[#class='bodyTitle']").text
print(title)
You're using driver inside your for loop meaning you're searching the whole page - so you will always get the same element.
You want to search from each item instead.
for item in eachrow:
title = item.find_element_by_xpath(".//span[#class='bodyTitle']").text
Also, there are no "URLs" in the rows as mentioned - when you click on a row the data-id attribute is used in the request.
<h1 class="name" data-id="1989" data-key="">
Which sends a request to https://www.abstractsonline.com/oe3/Program/9325/Presentation/694

Webscraping Click Button Selenium

I am trying to webscrape indeed.com to search for jobs using python, with selenium and beautifulsoup. I want to click next page but cant seem to figure out how to do this. Looked at many threads but it is unclear to me which element I am supposed to perform on. Here is the web page html and the code marked with grey comes up when I inspect the next button.
Also just to mention I tried first to follow what happens to the url when mousedown is executed. After reading the addppurlparam function and adding the strings in the function and using that url I just get thrown back to page one.
Here is my code for the class with selenium meant to click on the button:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome("C:/Users/alleballe/Downloads/chromedriver.exe")
driver.get("https://se.indeed.com/Internship-jobb")
print(driver.title)
#assert "Python" in driver.title
elem = driver.find_element_by_class_name("pagination-list")
elem = elem.find_element_by_xpath("//li/a[#aria-label='Nästa']")
print(elem)
assert "No results found." not in driver.page_source
assert elem
action = ActionChains(driver).click(elem)
action.perform()
print(elem)
driver.close()
The indeed site is formatted so that it shows 10 per page.
Your photo shows the wrong section of HTML instead you can see the links contain start=0 for the first page, start=10 for the second, start=20 for the third,...
You could use this knowledge to do a code like this:
while True:
i = 0
driver.get(f'https://se.indeed.com/jobs?q=Internship&start={i}')
# code here
i = i + 10
But, to directly answer to your question you should do:
next_page_link = driver.find_element_by_xpath('/html/head/link[6]')
driver.get(next_page_link)
This will find the link and then get it.
its work. paginated to next page.
driver.find_element_by_class_name("pagination-list").find_element_by_tag_name('a').click()

Categories

Resources