Selenium only find few elements - python

I want to make a recommendation system for webtoon, so I am collecting webtoon data. Currently, I wrote a code to scrap the url of the toons on the Kakao Webtoon page.
def extract_from_page(page_link):
links = []
driver = webdriver.Chrome()
driver.get(page_link)
elems = driver.find_elements_by_css_selector(".h-full.relative")
for elem in elems:
link = elem.get_attribute('href')
if link:
links.append({'id': int(link.split('/')[-1]), 'link': link})
print(len(links))
return links
This code works in weekly page(https://webtoon.kakao.com/original-webtoon, https://webtoon.kakao.com/original-novel)
However, in page that shows finished toons(https://webtoon.kakao.com/original-webtoon?tab=complete), it only receives 13 urls for the 13 webtoons at the top of the page.
I found similar post(web scraping gives only first 4 elements on a page) and add scroll, but noting changed.
I would appreciate it if you could tell me the cause and solution.

Try like below.
driver.get("https://webtoon.kakao.com/original-webtoon?tab=complete")
wait = WebDriverWait(driver,30)
j = 1
for i in range(5):
# Wait for the elements to load/appear
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(#href,'content')]")))
# Get all the elements which contains href value
links = driver.find_elements(By.XPATH,"//a[contains(#href,'content')]")
# Iterate to print the links
for link in links:
print(f"{j} : {link.get_attribute('href')}")
j += 1
# Scroll to the last element of the list links
driver.execute_script("arguments[0].scrollIntoView(true);",links[len(links)-1])
Output:
1 : https://webtoon.kakao.com/content/%EB%B0%A4%EC%9D%98-%ED%96%A5/1532
2 : https://webtoon.kakao.com/content/%EB%B8%8C%EB%A0%88%EC%9D%B4%EC%BB%A42/596
3 : https://webtoon.kakao.com/content/%ED%86%A0%EC%9D%B4-%EC%BD%A4%ED%94%8C%EB%A0%89%EC%8A%A4/1683
...

Related

How can I move to the next page with Selenium Python?

I need to access only the products on page 2 to 5 of the link below, the variable is at the end of the link where it changes according to the page sequence
driver.get(url)
classe = driver.find_elements(By. XPATH, "//*[#class='LinksShowcase_UrlContainer__kMj_n']/p")
pages = 1
for x in url:
driver.get("https://br.ebay.com/b/Portable-Audio/15052/bn_1642614?_pgn="+ str(pages))
sleep(2)
for i in classe:
#pages += 1
sleep(0.5)
links.append(i.text)
print(links)
sleep(2)
To get pages 2-5, you can iterate using the range() function:
for page in range(2, 6):
driver.get("https://br.ebay.com/b/Portable-Audio/15052/bn_1642614?_pgn="+ str(page))

Selenium pagination problem: First and last pages are skipped when scraping

I am using the code below to try to scrape product data from 90 pages; however the data from the first and last pages are missing in the list object when complete. Due to the nature of the website I cannot use scrapy or beautiful soup, so I am trying to navigate page by page with Selenium web driver. I have tried adjusting the number_of_pages to the actual number pages +1, which still skipped the first & last pages. I have also tried to set the page_to_start_clicking to 0 which produces a timeout error. Unfortunately I cannot share more about the source because of the authentication. Thank you in advanced for the help!
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#ResultsPerPageBottom > nav > span.next'))).click() # next button
number_of_pages = 90 # PROBLEM 1st & last pages missed
page_to_start_clicking = 1 # error if 0
# range set from 0; skips 1st and last page
for i in range(0, 90):
time.sleep(2)
for ele in driver.find_elements(By.CSS_SELECTOR, 'div.srp-item-body'):
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
print(ele.text)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, f"{page_to_start_clicking}"))).click()
page_to_start_clicking = page_to_start_clicking + 1
This was the code from the solution described in the comments.
# Scrape & pagination
wait = WebDriverWait(driver, 20)
number_of_pages = 91
listings = []
for i in range(0, 91):
time.sleep(2)
for ele in driver.find_elements(By.CSS_SELECTOR, 'div.srp-item-body'):
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
listings.append(ele.text)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#ResultsPerPageBottom > nav > span.next'))).click()

Selenium webdriver loops through all pages, but only scraping data for first page

I am attempting to scrape data through multiple pages (36) from a website to gather the document number and the revision number for each available document and save it to two different lists. If I run the code block below for each individual page, it works perfectly. However, when I added the while loop to loop through all 36 pages, it will loop, but only the data from the first page is saved.
#sam.gov website
url = 'https://sam.gov/search/?index=sca&page=1&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'
#webdriver
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
driver.get(url)
#get rid of pop up window
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
#list of revision numbers
revision_num = []
#empty list for all the WD links
WD_num = []
substring = '2015'
current_page = 0
while True:
current_page += 1
if current_page == 36:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
print('Last Page Complete!')
break
else:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
#click on next page
click_icon = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.ID,'bottomPagination-nextPage']))
click_icon.click()
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, 'main-container')))
Things I've tried:
I added the WebDriverWait in order to slow the script down for the page to load and/or elements to be clickable/located
I declared the empty lists outside the loop so it does not overwrite over each iteration
I have edited the while loop multiple times to either count up to 36 (while current_page <37) or moved the counter to the top or bottom of the loop)
Any ideas? TIA.
EDIT: added screenshot of 'field name'
I have refactor your code and made things very simple.
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
revision_num = []
WD_num = []
for page in range(1,37):
url = 'https://sam.gov/search/?index=sca&page={}&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'.format(page)
driver.get(url)
if page==1:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[contains(#class,'usa-link') and contains(.,'2015')]")))
wd_links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[#class='sds-field__name' and text()='Revision Number']/following-sibling::div")))
for element in elements:
revision_num.append(element.text)
for wd_link in wd_links:
WD_num.append(wd_link.text)
print(revision_num)
print(WD_num)
if you know only 36 pages to iterate you can pass the value in the url.
wait for element visible using webdriverwait
construct your xpath in such a way so can identify element uniquely without if, but.
console output on my terminal:

A variable is the same after the for loop and is different in another for loop

I have been getting this error that returns a link(good one) after the for loop and then when I put the link variables inside a for link in links loop it returns a different one. This is my code:
links=[]
def Start():
driver = webdriver.Chrome(executable_path="C:/Users/Tera max/.wdm/drivers/chromedriver/win32/87.0.4280.88/chromedriver.exe")
driver.get('https://instagram.com/')
sleep(2)
driver.maximize_window()
sleep(1)
driver.find_element_by_xpath('//*[#id="loginForm"]/div/div[3]/button').click()
sleep(3)
driver.find_element_by_xpath("//button[contains(text(), 'Not Now')]").click()
driver.get('https://www.instagram.com/explore/tags/{}/'.format("messi"))
sleep(2)
links=driver.find_elements_by_tag_name('a')
def condition(link):
return '.com/p/' in link.get_attribute('href')
valid_links = list(filter(condition,links))
for i in range(5):
link = valid_links[i].get_attribute('href')
if link not in links:
links.append(link)
print(link)
for link in links:
print(link)
driver.get(link)
Here is the out put of printing the first link then the second one:
#first
https://www.instagram.com/p/CKMR3lMAD8O/
#second
<selenium.webdriver.remote.webelement.WebElement (session="664522e3bb5f9a9527be40d5e34b79d6", element="4a0d1327-fd66-40ff-a622-55da864e9d14")>
print links and you'll see that the first link is the last element of links, while your second print is the first element of links.
I've reduced your code to demonstrate it more clearly:
links = []
for i in range(5):
link = i
links.append(i)
print(link)
for link in links:
print(link)
Output:
4 # < your first print
0 # The follow-up prints from the loop
1
2
3
4
you are reusing links without emptying them first. look at this line:
links=driver.find_elements_by_tag_name('a')
and later at this line
links.append(link)
so now links contains BOTH types of items. Either make a new list or do links.clear() before your for i loop

StaleElementReferenceException selenium webdriver python

I'm writing a crawler using Selenium, Python and PhantomJS to use Google's reverse image search. So far I've successfully been able to upload an image and crawl the search results on the first page. However, when I try to click on the search results navigation, I'm getting a StaleElementReferenceError. I have read about it in many posts but still I could not implement the solution. Here is the code that breaks:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
count = 0
for elem in ele5:
if count <= 2:
print str(elem.get_attribute("href"))
elem.click()
browser.implicitly_wait(20)
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
count += 1
The code breaks at print str(elem.get_attribute("href")) . How can I solve this?
Thanks in advance.
Clicking a link will cause the browser to go to another page; make references to the elements in old page (ele5, elem) invalid.
Modify the code not to reference invalid elements.
For example, you can get urls before you visit other pages:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
urls = [elem.get_attribute('href') for elem in ele5] # <-----
browser.implicitly_wait(20)
for url in urls[:2]: # <------
print url
browser.get(url) # <------ used `browser.get` instead of `click`.
# ; using `element.click` will cause the error.
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))

Categories

Resources