I wanted to extract data from multiple pages of a website. I was successfully able to extract the data from the first page, but was unable to move to the next page by using the next button...I will really appreciate if you would advise me in pagination. I am mentioning some part of the code where the suggestion is required...
def check_element_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
count = 0
while check_element_exists_by_xpath("//span[contains(text(), 'next')]"):
try:
if count > 0:
driver.find_element_by_xpath("//span[contains(text(), 'next')]")
mailcollector()
count = count + 1
except(NoSuchElementException, TimeoutException, WebDriverException):
time.sleep(3)
driver.refresh()
driver.back()
The Inspect element HTML code for Next button is
<li class = "pagination-link next-link">
<a data-aa-region="srp-pagination" data-aa-name="srp-next-page"
<span>next</span>
</a>
Related
I'm writing a script in to do some webscraping on my Firebase for a few select users. After accessing the events page for a user, I want to check for the condition that no events have been logged by that user first.
For this, I am using Selenium and Python. Using XPath seems to work fine for locating links and navigation in all other parts of the script, except for accessing elements in a table. At first, I thought I might have been using the wrong XPath expression, so I copied the path directly from Chrome's inspection window, but still no luck.
As an alternative, I have tried to copy the page source and pass it into Beautiful Soup, and then parse it there to check for the element. No luck there either.
Here's some of the code, and some of the HTML I'm trying to parse. Where am I going wrong?
# Using WebDriver - always triggers an exception
def check_if_user_has_any_data():
try:
time.sleep(10)
element = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#id="event-table"]/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div')))
print(type(element))
if element == True:
print("Found empty state by copying XPath expression directly. It is a bit risky, but it seems to have worked")
else:
print("didn’t find empty state")
except:
print("could not find the empty state element", EC)
# Using Beautiful Soup
def check_if_user_has_any_data#2():
time.sleep(10)
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'html.parser')
print(soup.text[:500])
print(len(soup.findAll('div', {"class": "table-row-no-data ng-scope"})))
HTML
<div class="table-row-no-data ng-scope" ng-if="::config" ng-class="{overlay: config.isBuilderOpen()}">
<div class="no-data-content layout-align-center-center layout-row" layout="row" layout-align="center center">
<!-- ... -->
</div>
The first version triggers the exception and is expected to evaluate 'element' as True. Actual, the element is not found.
The second version prints the first 500 characters (correctly, as far as I can tell), but it returns '0'. It is expected to return '1' after inspecting the page source.
Use the following code:
elements = driver.find_elements_by_xpath("//*[#id='event-table']/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div")
size = len(elements)
if len(elements) > 0:
# Element is present. Do your action
else:
# Element is not present. Do alternative action
Note: find_elements will not generate or throw any exception
Here is the method that generally I use.
Imports
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
Method
def is_element_present(self, how, what):
try:
self.driver.find_element(by=how, value=what)
except NoSuchElementException as e:
return False
return True
Some things load dynamically. It is better to just set a timeout on a wait exception.
If you're using Python and Selenium, you can use this:
try:
driver.find_element_by_xpath("<Full XPath expression>") # Test the element if exist
# <Other code>
except:
# <Run these if element doesn't exist>
I've solved it. The page had a bunch of different iframe elements, and I didn't know that one had to switch between frames in Selenium to access those elements.
There was nothing wrong with the initial code, or the suggested solutions which also worked fine when I tested them.
Here's the code I used to test it:
# Time for the page to load
time.sleep(20)
# Find all iframes
iframes = driver.find_elements_by_tag_name("iframe")
# From inspecting page source, it looks like the index for the relevant iframe is [0]
x = len(iframes)
print("Found ", x, " iFrames") # Should return 5
driver.switch_to.frame(iframes[0])
print("switched to frame [0]")
if WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#class="no-data-title ng-binding"]'))):
print("Found it in this frame!")
Check the length of the element you are retrieving with an if statement,
Example:
element = ('https://www.example.com').
if len(element) > 1:
# Do something.
In the below URL i need to click the message icon links which containing 'svg' tags inside it.
https://www.sciencedirect.com/science/article/pii/S0898656817301687
for that iam using below code
lenoftags = driver.find_elements_by_xpath('//a[#class="author size-m workspace-trigger"]//*[local-name()="svg"]')
tagcount = len(lenoftags)
newcount = range(1, tagcount)
if len(lenoftags) == 0:
driver.back()
elif len(lenoftags) >= 1:
for jj in newcount:
try:
driver.find_element_by_xpath('//a[#class="author size-m workspace-trigger"][%d]//*[local-name()="svg"]'%jj).click()
except (NoSuchElementException, TimeoutException, WebDriverException):
try:
driver.find_element_by_xpath('//a[#class="author size-m workspace-trigger"]//*[local-name()="svg"]').click()
except (NoSuchElementException, TimeoutException, WebDriverException):
continue
driver.back()
driver.back()
else:
driver.back()
the code is working when the links in order but in above URL it is clicking only first link.
any one please resolve this
You can avoid implementing extra logic. Try below instead:
tags = driver.find_elements_by_xpath('//a[contains(#class, "author")]//*[local-name()="svg"]')
if tags:
# Links found
for tag in tags:
tag.click()
else:
# Links not found
# Do something else
total_link = []
temp = ['a']
total_num = 0
while driver.find_element_by_tag_name('div'):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Divs=driver.find_element_by_tag_name('div').text
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'div._6d3hm > div._mck9w'
)
for title in my_titles:
try:
if title in temp:
#print('중복')
pass
else:
#print('중복이 아니다')
link = str(title.a.get("href")) #주소를 가져와!
total_link.append(link)
#print(link)
except:
pass
print("현재 모은 개수: " + str(len(total_link)))
temp = my_titles
time.sleep(2)
if 'End of Results' in Divs:
print('end')
break
else:
continue
Blockquote
Hello I was scraping instagram data with the tags in korean.
My code is consisted in the followings.
scroll down the page
by using bs4 and requests, get their HTML
locate to the point where the time log, picture src, text, tags, ID
select them all, and crawl it.
after it is done with the HTML that is on the page, scroll down
do the same thing until the end
By doing this, and using the codes of the people in this site, it seemed to work...
but after few scrolls going down, at certain points, scroll stops with the error message showing
'읽어드리지 못합니다' or in English 'Unable to read'
Can I know the reason why the error pops up and how to solve the problem?
I am using python and selenium
thank you for your answer
Instagram is trying to protect against malicious attacks, such as scraping or any other automated ways. It often occurs when you are trying to access to Instagram pages abnormally fast. So you have to set time.sleep() options more frequently or longer.
I am trying to scrape using Selenium in Python where I am trying to loop through landing pages on bigkinds.or.kr by clicking on the increasing number button.
The next page is located in the following HTML according to the Chrome Inspector:
<div class="newsPage">
<div class="btmDelBtn">
...</div>
<span>
1
2
3
4
5
6
</span>
I am not getting success in crawling by clicking next page. Please help me.
Here is my code:
url = "https://www.bigkinds.or.kr/main.do"
browser.get(url)
...
currentPageElement = browser.find_element_by_xpath("//*[#id='content']/div/div/div[2]/div[7]/span/a[2]")
print(currentPageElement)
currentPageNumber = int(currentPageElement.text)
print(currentPageNumber)
In xpath, "/span/a[2]" is a page number. How can I make loop for this xpath.
Try to use below code:
from selenium.common.exceptions import NoSuchElementException
url = "https://www.bigkinds.or.kr/main.do"
browser.get(url)
page_count = 1
while True:
# Increase page_count value on each iteration on +1
page_count += 1
# Do what you need to do on each page
# Code goes here
try:
# Clicking on "2" on pagination on first iteration, "3" on second...
browser.find_element_by_link_text(str(page_count)).click()
except NoSuchElementException:
# Stop loop if no more page available
break
Update
If you still want to use search by XPath, you might need to replace line
browser.find_element_by_link_text(str(page_count)).click()
with line
browser.find_element_by_xpath('//a[#onclick="getSearchResultNew(%s)"]' % page_count).click()
...or if you want to use your absolute XPath (not the best idea), you can try
browser.find_element_by_xpath("//*[#id='content']/div/div/div[2]/div[7]/span/a[%s]" % page_count).click()
guys I need to write a script that use selenium to go over the pages on the website and download each page to a file.
This is the website I need to go through and I wanna download all 10 pages of reviews.
This is my code:
import urllib2,os,sys,time
from selenium import webdriver
browser=urllib2.build_opener()
browser.addheaders=[('User-agent', 'Mozilla/5.0')]
url='http://www.imdb.com/title/tt2948356/reviews?ref_=tt_urv'
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)
time.sleep(2)
if not os.path.exists('reviewPages'):os.mkdir('reviewPages')
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(1)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
print 'page 1 done'
page=2
while True:
cssPath='#tn15content > table:nth-child(4) > tbody > tr > td:nth-child(2) > a:nth-child(11) > img'
try:
button=driver.find_element_by_css_selector(cssPath)
except:
error_type, error_obj, error_info = sys.exc_info()
print 'STOPPING - COULD NOT FIND THE LINK TO PAGE: ', page
print error_type, 'Line:', error_info.tb_lineno
break
button.click()
time.sleep(2)
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(page)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
time.sleep(2)
print 'page',page,'done'
page+=1
But the program just stop downloading the first page. Could someone help? Thanks.
So a few things that are causing this.
Your first I think that's causing you issues is:
table:nth-child(4)
When I go to that website, I think you just want:
table >
The second error is the break statement in your except message. This says, when I get an error, stop looping.
So what's happening is your try, except is not working because your CSS selector is not quite correct, and going to your exception where you are telling it to stop looping.
Instead of that very complex CSS path try this simpler xpath ('//a[child::img[#alt="[Next]"]]/#href') which will return the URL associated with the little triangular 'next' button on each page.
Or notice that each page has 10 reviews and the URLs for pages 2 to 10 just give the start review number, ie http://www.imdb.com/title/tt2948356/reviews?start=10 which is the URL for page 2. Simply calculate the URL for the next page and stop when it doesn't fetch anything.