Python with Selenium : pagination issue - python

I am trying to scrape using Selenium in Python where I am trying to loop through landing pages on bigkinds.or.kr by clicking on the increasing number button.
The next page is located in the following HTML according to the Chrome Inspector:
<div class="newsPage">
<div class="btmDelBtn">
...</div>
<span>
1
2
3
4
5
6
</span>
I am not getting success in crawling by clicking next page. Please help me.
Here is my code:
url = "https://www.bigkinds.or.kr/main.do"
browser.get(url)
...
currentPageElement = browser.find_element_by_xpath("//*[#id='content']/div/div/div[2]/div[7]/span/a[2]")
print(currentPageElement)
currentPageNumber = int(currentPageElement.text)
print(currentPageNumber)
In xpath, "/span/a[2]" is a page number. How can I make loop for this xpath.

Try to use below code:
from selenium.common.exceptions import NoSuchElementException
url = "https://www.bigkinds.or.kr/main.do"
browser.get(url)
page_count = 1
while True:
# Increase page_count value on each iteration on +1
page_count += 1
# Do what you need to do on each page
# Code goes here
try:
# Clicking on "2" on pagination on first iteration, "3" on second...
browser.find_element_by_link_text(str(page_count)).click()
except NoSuchElementException:
# Stop loop if no more page available
break
Update
If you still want to use search by XPath, you might need to replace line
browser.find_element_by_link_text(str(page_count)).click()
with line
browser.find_element_by_xpath('//a[#onclick="getSearchResultNew(%s)"]' % page_count).click()
...or if you want to use your absolute XPath (not the best idea), you can try
browser.find_element_by_xpath("//*[#id='content']/div/div/di‌​v[2]/div[7]/span/a[%s​]" % page_count).click()

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

I'm trying to set up a simple webscraping script to pull every hyperlink from the discover cards on Bandcamp.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
browser = webdriver.Chrome()
all_links = []
page = 1
url = "https://bandcamp.com/?g=all&s=new&p=0&gn=0&f=digital&w=-1"
browser.get(url)
while page < 6:
page += 1
# wait until discover cards are loaded
test = WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, '//*[#id="discover"]/div[9]/div[2]/div/div[1]/div/table/tbody/tr[1]/td[1]/a/div')))
# scrape hyperlinks for each of the 8 albums shown
titles = browser.find_elements(By.CLASS_NAME, "item-title")
links = [title.get_attribute('href') for title in titles[-8:]]
all_links = all_links + links
print(links)
# pagination - click through the page buttons as the links are scraped
page_nums = browser.find_elements(By.CLASS_NAME, 'item-page')
for page_num in page_nums:
if page_num.text.isnumeric():
if int(page_num.text) == page:
page_num.click()
time.sleep(20) # I've tried multiple long wait times as well as WebDriverWaits on different elements to see if the HTML will update, but I haven't seen a positive effect
break
I'm using print(links) to see where this is going wrong. In the selenium browser, it clicks through the pages well. Note that pagination via the url parameters doesn't seem possible as the discover cards often won't load unless you click the page buttons towards the bottom of my picture. BetterSoup and Requests don't work either for the same reason. The print function is returning the following:
['https://neubauten.bandcamp.com/album/stimmen-reste-musterhaus-7?from=discover-new', 'https://cirka1.bandcamp.com/album/time?from=discover-new', 'https://futuramusicsound.bandcamp.com/album/yoga-meditation?from=discover-new', 'https://deathsoundbatrecordings.bandcamp.com/album/real-mushrooms-dsbep092?from=discover-new', 'https://riacurley.bandcamp.com/album/take-me-album?from=discover-new', 'https://terracuna.bandcamp.com/album/el-origen-del-viento?from=discover-new', 'https://hyper-music.bandcamp.com/album/hypermusic-vol-4?from=discover-new', 'https://defisis1.bandcamp.com/album/priceless?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://finitysounds.bandcamp.com/album/kreme?from=discover-new', 'https://mylittlerobotfriend.bandcamp.com/album/amen-break?from=discover-new', 'https://electrinityband.bandcamp.com/album/rise?from=discover-new', 'https://abyssal-void.bandcamp.com/album/ritualist?from=discover-new', 'https://plataformarecs.bandcamp.com/album/v-a-david-lynch-experience?from=discover-new', 'https://hurricaneturtles.bandcamp.com/album/industrial-synth?from=discover-new', 'https://blackwashband.bandcamp.com/album/2?from=discover-new', 'https://worldwide-bitchin-records.bandcamp.com/album/wack?from=discover-new']
Each time it correctly pulls the first 8 albums on page 1, then for pages 2-4 it repeats the 8 albums on page 2, for pages 5-7 it repeats the 8 albums on page 5, and so on. Even though the page is updating (and the url changes) in the selenium browser, for some reason selenium is not recognizing any changes to the html so it repeats the same titles. Any idea where I've gone wrong?
Your definition of titles, i.e.
titles = browser.find_elements(By.CLASS_NAME, "item-title")
is a bad idea because item-title is the class of many elements in the page. Then another bad idea is to pick titles[-8:]. It may sounds good because you think ok each time I click a page the new elements are added at the end, but this is not always the case. Your case is one of those were elements are not added sequentially.
So let's start by considering a class exclusive of cards. For example discover-item. Then open the DevTools, press CTRL+F and enter .discover-item. When the url is first loaded, it will find 8 results. Now click next page, now it finds 16 results, click again and will find 24 results. To better see what's going I suggest you to run the following each time you click on the "next" button.
el = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
for i,e in enumerate(el):
print(i,e.get_attribute('innerText').replace('\n',' - '))
In particular, when arriving to page 3, you will see that the first item shown in page 1 (which in my case is 0 and friends - Jacob Stanley - alternative), is now printed at a different position (in my case 8 and friends - Jacob Stanley - alternative). What happened is that the items of page 3 were added at the beginning of the list, and so you can see why titles[-8:] was a bad choice.
So a better choice is to consider all cards each time you go to the next page, instead of the last 8 only (notice that the HTML of this site can contain no more than 24 cards), and then add all current cards to a set (since a set cannot contain duplicates, only new elements will be added).
# scroll to cards
cards = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".discover-item")))
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[0])
time.sleep(1)
items = set()
while 1:
links = driver.find_elements(By.CSS_SELECTOR, '.discover-item .item-title')
# extract info and store it
for idx,card in enumerate(cards):
tit_art_gen = card.get_attribute('innerText').replace('\n',' - ')
href = links[idx].get_attribute('href')
# print(idx, tit_art_gen)
items.add(tit_art_gen + ' - ' + href)
# click 'next' button if it is not disabled
next_button = driver.find_element(By.XPATH, "//a[.='next']")
if 'disabled' in next_button.get_attribute('class'):
print('last page reached')
break
else:
next_button.click()
# wait until new elements are loaded
cards_new = cards.copy()
while cards_new == cards:
cards_new = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
time.sleep(.5)
cards = cards_new.copy()

Getting an "Stale element reference" When trying to loop through pages with the intention of scraping multiple pages

I'm having an issue with my Python code. The intension is to use Selenium to open up the website (craigslist), search a text (Honda) then scrape three pages of this site. I keep getting the
"StaleElementReferenceException: stale element reference: element is not attached to the page document" exception
when the iteration reaches the second page. I cant exactly tell why its stopping at the second page and not clicking the "next" button once more to reach the third page then finally scraping the data and printing it.
This is my code:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
DRIVER_PATH = "/Users/mouradsal/Downloads/DataSets Python/chromedriver"
URL = "https://vancouver.craigslist.org/"
browser = webdriver.Chrome(DRIVER_PATH)
browser.get(URL)
browser.maximize_window()
time.sleep(4)
search = browser.find_element_by_css_selector("#query")
search.send_keys("Honda")
search.send_keys(u'\ue007')
content = browser.find_elements_by_css_selector(".hdrlnk")
button = browser.find_element_by_css_selector(".next")
for i in range(0,3):
button.click()
print("Count: "+ str(i))
time.sleep(10)
print("done loop ")
for e in content:
start = e.get_attribute("innerHTML")
soup = BeautifulSoup(start, features=("lxml"))
print(soup.get_text())
print("***************************")
Any suggestions would be greatly appreciated!
Thanks
for i in range(0,3):
button = driver.find_element_by_css_selector(".next")
button.click()
print("Count: "+ str(i))
time.sleep(10)
You need to nest your finding of elements cause webelements change every time you get to a new page.

How to automate pagination using Selenium Webdriver?

I wanted to extract data from multiple pages of a website. I was successfully able to extract the data from the first page, but was unable to move to the next page by using the next button...I will really appreciate if you would advise me in pagination. I am mentioning some part of the code where the suggestion is required...
def check_element_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
count = 0
while check_element_exists_by_xpath("//span[contains(text(), 'next')]"):
try:
if count > 0:
driver.find_element_by_xpath("//span[contains(text(), 'next')]")
mailcollector()
count = count + 1
except(NoSuchElementException, TimeoutException, WebDriverException):
time.sleep(3)
driver.refresh()
driver.back()
The Inspect element HTML code for Next button is
<li class = "pagination-link next-link">
<a data-aa-region="srp-pagination" data-aa-name="srp-next-page"
<span>next</span>
</a>

WebScraping Next pages with Selenium

When I navigate to the below link and locate the pagination at the bottom of the page:
https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&sort=Boosted
I am only able to scrape the first 4 or so pages then my script stops
I have tried with xpath, css_selector, and with the WebDriverWait options
pages_remaining = True
page = 2 //starts # page 2 since page one is scraped already with first loop
while pages_remaining:
//scrape code
try:
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, str(page)))).click()
print browser.current_url
page += 1
except TimeoutException:
pages_remaining = False
Current Results from console:
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories- shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=2&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=3&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=4&sort=Boosted
This solution is a BeautifulSoup one, because I am not too familiar with Selenium.
Try to create a new variable with your number of pages. As you can see, when you enter the next page the URL changes, thus just manipulate the given URL. See my code example below.
# Define variable pages first
pages = [str(i) for i in range(1,53)] # 53 'cuz you have 52 pages
for page in pages:
response = get("https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page=" + page + "&sort=Boosted"
# Rest of you code
This snippet should do the job for the rest of the pages. Hope that helps, although this might not exactly what you have been looking for.
When you have any questions just post below. ;).
Cheers.
You could loop throught page numbers until no more results are shown by just changing the url:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = "https://m.shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page={}&sort=Boosted"
driver = webdriver.Chrome()
page = 1
soup = BeautifulSoup("")
#Will loop untill there's no more results
while "Looks like we don’t have exactly what you’re looking for." not in soup.text:
print(base_url.format(page))
#Go to page
driver.get(base_url.format(page))
soup = BeautifulSoup(driver.page_source)
### your extracting code
page +=1

python using Selenium to download files

guys I need to write a script that use selenium to go over the pages on the website and download each page to a file.
This is the website I need to go through and I wanna download all 10 pages of reviews.
This is my code:
import urllib2,os,sys,time
from selenium import webdriver
browser=urllib2.build_opener()
browser.addheaders=[('User-agent', 'Mozilla/5.0')]
url='http://www.imdb.com/title/tt2948356/reviews?ref_=tt_urv'
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)
time.sleep(2)
if not os.path.exists('reviewPages'):os.mkdir('reviewPages')
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(1)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
print 'page 1 done'
page=2
while True:
cssPath='#tn15content > table:nth-child(4) > tbody > tr > td:nth-child(2) > a:nth-child(11) > img'
try:
button=driver.find_element_by_css_selector(cssPath)
except:
error_type, error_obj, error_info = sys.exc_info()
print 'STOPPING - COULD NOT FIND THE LINK TO PAGE: ', page
print error_type, 'Line:', error_info.tb_lineno
break
button.click()
time.sleep(2)
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(page)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
time.sleep(2)
print 'page',page,'done'
page+=1
But the program just stop downloading the first page. Could someone help? Thanks.
So a few things that are causing this.
Your first I think that's causing you issues is:
table:nth-child(4)
When I go to that website, I think you just want:
table >
The second error is the break statement in your except message. This says, when I get an error, stop looping.
So what's happening is your try, except is not working because your CSS selector is not quite correct, and going to your exception where you are telling it to stop looping.
Instead of that very complex CSS path try this simpler xpath ('//a[child::img[#alt="[Next]"]]/#href') which will return the URL associated with the little triangular 'next' button on each page.
Or notice that each page has 10 reviews and the URLs for pages 2 to 10 just give the start review number, ie http://www.imdb.com/title/tt2948356/reviews?start=10 which is the URL for page 2. Simply calculate the URL for the next page and stop when it doesn't fetch anything.

Categories

Resources