How to visit multiple links with selenium - python

i'm trying to visit multiple links from one page, and then go back to the same page.
links = driver.find_elements(By.CSS_SELECTOR,'a')
for link in links:
link.click() # visit page
# scrape page
driver.back() # get back to previous page, and click the next link in next iteration
The code says it all

By navigating to another page all collected by selenium web elements (they are actually references to a physical web elements) become no more valid since the web page is re-built when you open it again.
To make your code working you need to collect the links list again each time.
This should work:
import time
links = driver.find_elements(By.CSS_SELECTOR,'a')
for i in range(len(links)):
links[i].click() # visit page
# scrape page
driver.back() # get back to previous page, and click the next link in next iteration
time.sleep(1) # add a delay to make the main page loaded
links = driver.find_elements(By.CSS_SELECTOR,'a') # collect the links again on the main page
Also make sure all the a elements on that page are relevant links. Since this may not be correct

The logic in your code should work, however you might want to add a sleep in between certain actions, it makes a difference when scraping.
import time
and then add time.sleep(seconds) where it matters.

Related

How to do a loop on a dynamic href link with selenium in python?

I would like to make a loop on a dynamic href. Indeed, I download a set of files per page. On each page, I download 100 text files but I have to download 200 000 files. So, I have to click the next button in 2000. To do this, I got the href address of the next button but unfortunately, two objects change in this link, the page number 1,2,3, etc. and a string of characters. Please see attached sample of the next button that changes.
https://search.proquest.com/something/E6981FD6D11F45E8PQ/2?accountid=12543#scrollTo
https://search.proquest.com/something/E6981FD6D11F45E8PQ/3?accountid=12543#scrollTo
https://search.proquest.com/something/61C27022597C4092PQ/4?accountid=12543#scrollTo
https://search.proquest.com/something/E431552DC6554BF7PQ/5?accountid=12543#scrollTo
I'm novel user of Python. My level is bad.
#Before I add selenium setup for scraping.
n=2000
for i in range(1,n):
href="https://search.proquest.com/something/715376F5A5AF44BBPQ/" + str(i) + "?accountid=12543#scrollTo"
driver.get(href)
#Here, I add the code which allows downloading for each page.
Sample link is unavailable for me (i cannot signing up)
First..
what is "string of chacracters"?
book number? or category number?
if it is just random string, i think you should find another way.
How about using ActionChain? or driver.execute_script()?
First of all, In my opinion, Finding a meaning of string (from .js or .html) is more important.
#나민오 I need help in identifying xpath for my next page button. My goal consists to loop through pages in Python Selenium. Please find below the code of the next page button after inspecting on URL page on this picture.
next page button picture after inspect
I try to write the following code in python with selenium to download the file by page.
while True:
scraping() # here I call my function that allows to download the files per page
try:
#Checks if there are more pages with links
next_link = driver.find_element_by_xpath("//*[#title='Page suivante']")
drive.execute_script("arguments[0].scrollIntoView();", next_link)
next_link.click()
#Time sleep
time.sleep(20)
except NoSuchElementException:
pages_rows= False

scrape Glassdoor for multiple pages using python lxml

I'm using the following script to scrape job listings via Glassdoor. The script below only scrapes the first page. I was wondering, how might I extend it so that it scrapes from page 1 up to the last page?
https://www.scrapehero.com/how-to-scrape-job-listings-from-glassdoor-using-python-and-lxml/
I'd greatly appreciate any help
I'll provide a more general answer. When scraping, to get the next page simply get the link on the page to the next page.
In the case of Glassdoor, your page links all have the page class and the next page is accessed by clicking an li button with class next. Your XPath then becomes:
//li[#class="next"]
You can then access it with:
element = document.xpath("//li[#class='next']")
We are specifically looking for the link so we can add a to our xpath:
//li[#class="next"]//a
And further specify that we just need the href attribute:
//li[#class="next"]//a/#href
And now you can access the link with
link = document.xpath('//li[#class="next"]//a/#href')
Tested and working on Glassdoor as of 2/9/18.

How to scrape javascript dynamic website

i've been trying to scrape the website below but having some problems.I cannot find how they build the list of empresas(in english : companies) that they show. When i select some categorie and submit the form, the url doesnt change, i've tryed to look in the request but no sucess.(not a webdeveloper here ).
http://www.vitrinedoexportador.gov.br
I first tried to go though all links in the webpage. The first approach that i've tried was bruteforcing all the urls. They has this syntax.
"http://www.vitrinedoexportador.gov.br/bens/ve/br/detalhes/index/cdEmpresa/" + 6 digit code + "#inicio".
But i think that trying out 999999 possibilities would be wrong way to aproach the problem.
The next approach that i'm trying is navigatin through the pages using selenium webdriver.
with the code below:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
browser.get('http://www.vitrinedoexportador.gov.br/bens/ve/br#a')
# navigate to the page
select = Select(browser.find_element_by_id('cdSetor'))
print (select.options)
for opt in select.options:
print (opt.text)
opt.click()
if(opt.text != 'Escolha'):
opt.submit()
time.sleep(5) # tem q colocar esse para a página poder carregar.
listaEmpresas = browser.find_elements_by_tag_name("h6")
for link in listaEmpresas:
print(link)
print (listaEmpresas)
listaEmpresas[0].click()
But seens incredibly slow, and i only could get still one companie, is there a more smart way to do this?
Other approach that i've tried is using scrap, i can already parse a entire companie page with all the fields that i want. so if u guys help me in the way to get all the IDS , i can parse in my already built-in scrapy project.
Thank you.
I've done something very similar to this already and there is no super easy way. There is usually no list with all companies, because that belongs to the backend. You have to use the frontend to navigate to a page where you can build a loop to scrap what you want.
For example: I clicked the main url, then I changed the filter 'Valor da empresa' which has only five options. I chose the first, which gave me 3436 companies. Now it dependes if you want to scrap details of company or only main info, like tel cep address that are already in this page. If you want details you have to build a loop that clicks in each link, scrap from main page, go back to search and click on the next link. If you need only main information, you can already get that on search page by grabbing class=resultitem with beautifull soup, and looping through data to get first page.
In any case, next step (after all links of first page are scraped) is pressing the second page and doing it again.
After you scrap all 3436 of first filter, do it again for other 4 filters, and you would get all companies
You can use other filters, but they have many options and to go through all companies you would have to go through all of them, which is more work.
Hope that helps!

Scraping a paginated website with fixed url for every page (Python)

I am not familiar with HTML but I will do my best to explain what I need. I am scraping a website that is paginated. If one goes to the bottom one finds a 'Siguiente' (next in spanish) button that will lead to the next page. When doing so, the URL remains unchanged. Is there any way to tell Python to open the next page?
I want to do this:
(already done)
open the website,
do sth with the info there,
(having trouble because there is no URL of the next page)
go to the next page,
repeat...
Thanks for your help.

Tips on navigating through thousands of web pages and scraping them?

I need to scrape data from an html table with about 20,000 rows. The table, however, is separated into 200 pages with 100 rows in each page. The problem is that I need to click on a link in each row to access the necessary data.
I was wondering if anyone had any tips to go about doing this because my current method, shown below, is taking far too long.
The first portion is necessary for navigating through Shiboleth. This part is not my concern as it only takes around 20 seconds and happens once.
from selenium import webdriver
from selenium.webdriver.support.ui import Select # for <SELECT> HTML form
driver = webdriver.PhantomJS()
# Here I had to select my school among others
driver.get("http://onesearch.uoregon.edu/databases/alphabetical")
driver.find_element_by_link_text("Foundation Directory Online Professional").click()
driver.find_element_by_partial_link_text('Login with your').click()
# We are now on the login in page where we shall input the information.
driver.find_element_by_name('j_username').send_keys("blahblah")
driver.find_element_by_name('j_password').send_keys("blahblah")
driver.find_element_by_id('login_box_container').submit()
# Select the Search Grantmakers by I.D.
print driver.current_url
driver.implicitly_wait(5)
driver.maximize_window()
driver.find_element_by_xpath("/html/body/header/div/div[2]/nav/ul/li[2]/a").click()
driver.find_element_by_xpath("//input[#id='name']").send_keys("family")
driver.find_element_by_xpath("//input[#id='name']").submit()
This is the part that is taking too long. The scraping part is not included in this code.
# Now I need to get the page source for each link of 20299 pages... :()
list_of_links = driver.find_elements_by_css_selector("a[class='profile-gate-check search-result-link']")
# Hold the links in a list instead of the driver.
list_of_linktext = []
for link in list_of_links:
list_of_linktext.append(link.text)
# This is the actual loop that clicks on each link on the page.
for linktext in list_of_linktext:
driver.find_element_by_link_text(linktext).click()
driver.implicitly_wait(5)
print driver.current_url
driver.back()
driver.implicitly_wait(5) #Waits to make sure that the page is reached.
Navigating 1 out of the 200 pages takes about 15 minutes. Is there a better way to do this?
I tried using an explicit wait instead of an implicit wait.
for linktext in list_of_linktext:
# explicit wait
WebDriverWait(driver, 2).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "a[class='profile-gate-check search-result-link']"))
)
driver.find_element_by_link_text(linktext).click()
print driver.current_url
driver.back()
The problem, however, still persists with an avg time of 5 seconds before each page.
For screen scraping, I normally steer clear of Selenium altogether. There are faster, more reliable ways to scrape data from a website.
If you're using Python, you might give beautifulsoup a try. It seems very similar to other site-scraping tools I've used in the past for other languages (most notably JSoup and NSoup).

Categories

Resources