Tips on navigating through thousands of web pages and scraping them? - python

I need to scrape data from an html table with about 20,000 rows. The table, however, is separated into 200 pages with 100 rows in each page. The problem is that I need to click on a link in each row to access the necessary data.
I was wondering if anyone had any tips to go about doing this because my current method, shown below, is taking far too long.
The first portion is necessary for navigating through Shiboleth. This part is not my concern as it only takes around 20 seconds and happens once.
from selenium import webdriver
from selenium.webdriver.support.ui import Select # for <SELECT> HTML form
driver = webdriver.PhantomJS()
# Here I had to select my school among others
driver.get("http://onesearch.uoregon.edu/databases/alphabetical")
driver.find_element_by_link_text("Foundation Directory Online Professional").click()
driver.find_element_by_partial_link_text('Login with your').click()
# We are now on the login in page where we shall input the information.
driver.find_element_by_name('j_username').send_keys("blahblah")
driver.find_element_by_name('j_password').send_keys("blahblah")
driver.find_element_by_id('login_box_container').submit()
# Select the Search Grantmakers by I.D.
print driver.current_url
driver.implicitly_wait(5)
driver.maximize_window()
driver.find_element_by_xpath("/html/body/header/div/div[2]/nav/ul/li[2]/a").click()
driver.find_element_by_xpath("//input[#id='name']").send_keys("family")
driver.find_element_by_xpath("//input[#id='name']").submit()
This is the part that is taking too long. The scraping part is not included in this code.
# Now I need to get the page source for each link of 20299 pages... :()
list_of_links = driver.find_elements_by_css_selector("a[class='profile-gate-check search-result-link']")
# Hold the links in a list instead of the driver.
list_of_linktext = []
for link in list_of_links:
list_of_linktext.append(link.text)
# This is the actual loop that clicks on each link on the page.
for linktext in list_of_linktext:
driver.find_element_by_link_text(linktext).click()
driver.implicitly_wait(5)
print driver.current_url
driver.back()
driver.implicitly_wait(5) #Waits to make sure that the page is reached.
Navigating 1 out of the 200 pages takes about 15 minutes. Is there a better way to do this?
I tried using an explicit wait instead of an implicit wait.
for linktext in list_of_linktext:
# explicit wait
WebDriverWait(driver, 2).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "a[class='profile-gate-check search-result-link']"))
)
driver.find_element_by_link_text(linktext).click()
print driver.current_url
driver.back()
The problem, however, still persists with an avg time of 5 seconds before each page.

For screen scraping, I normally steer clear of Selenium altogether. There are faster, more reliable ways to scrape data from a website.
If you're using Python, you might give beautifulsoup a try. It seems very similar to other site-scraping tools I've used in the past for other languages (most notably JSoup and NSoup).

Related

SELENIUM (Python) : How to retrieve the URL to which an element redirects me to (opens a new tab) after clicking? Element has <a> tag but no href

I am trying to scrape a website with product listings that if clicked on redirect the user to a new tab with further information/contact the seller details. I am trying to retrieve said URL without actually having to click on each listing in the catalog and wait for the page to load as this would take a lot of time.
I have searched in web inspector for the "href" but the only link available is to the image source of each listing. However, I noticed that after clicking each element, a GET request method gets sent and this is the URL (https://api.wallapop.com/api/v3/items/v6g2v4y045ze?language=es) it contains pretty much all the information I need, I'm not sure if it's of any use, but its the furthest I've gotten.
UPDATE: I tried the code I was suggested (with modifications to specifically find the 'href' attributes in the clickable elements), but I get None returning. I have been looking into finding an 'onclick' element or something similar that might have what I'm looking for but so far it looks like the solution will end up being clicking each element and extracting all the information from there.
elements123 = driver.find_elements(By.XPATH, '//a[contains(#class,"ItemCardList__item")]')
for e in elements123:
print(e.get_attribute('href'))
I appreciate any insights, thank you in advance.
You need something like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://google.com")
# Get all the elements available with tag name 'a'
elements = driver.find_elements(By.TAG_NAME, 'a')
for e in elements:
print(e.get_attribute('href'))

How to visit multiple links with selenium

i'm trying to visit multiple links from one page, and then go back to the same page.
links = driver.find_elements(By.CSS_SELECTOR,'a')
for link in links:
link.click() # visit page
# scrape page
driver.back() # get back to previous page, and click the next link in next iteration
The code says it all
By navigating to another page all collected by selenium web elements (they are actually references to a physical web elements) become no more valid since the web page is re-built when you open it again.
To make your code working you need to collect the links list again each time.
This should work:
import time
links = driver.find_elements(By.CSS_SELECTOR,'a')
for i in range(len(links)):
links[i].click() # visit page
# scrape page
driver.back() # get back to previous page, and click the next link in next iteration
time.sleep(1) # add a delay to make the main page loaded
links = driver.find_elements(By.CSS_SELECTOR,'a') # collect the links again on the main page
Also make sure all the a elements on that page are relevant links. Since this may not be correct
The logic in your code should work, however you might want to add a sleep in between certain actions, it makes a difference when scraping.
import time
and then add time.sleep(seconds) where it matters.

click link automatically and scraping

I try to extract the all the products data from this page:
https://www.shufersal.co.il/online/he/קטגוריות/סופרמרקט/חטיפים%2C-מתוקים-ודגני-בוקר/c/A25
I tried
shufersal = "https://www.shufersal.co.il/online/he/%D7%A7%D7%98%D7%92%D7%95%D7%A8%D7%99%D7%95%D7%AA/%D7%A1%D7%95%D7%A4%D7%A8%D7%9E%D7%A8%D7%A7%D7%98/%D7%97%D7%98%D7%99%D7%A4%D7%99%D7%9D%2C-%D7%9E%D7%AA%D7%95%D7%A7%D7%99%D7%9D-%D7%95%D7%93%D7%92%D7%A0%D7%99-%D7%91%D7%95%D7%A7%D7%A8/c/A25"
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
import time
driver.get(shufersal)
products = driver.find_elements_by_css_selector("li.miglog-prod.miglog-sellingmethod-by_unit"
)
the problem is that the product details is showed only when I click the product:
There any option to click all links automatically and scraping the open windows?
What you want can be achieved, but is a significantly more time consuming processes.
You'll have to:
identify the elements that you need to click in the page (they'll probably have the same class) and then select them all:
butons_to_click = driver.find_elements_by_css_selector({RELEVANT SELECTOR HERE})
Then you should loop through all clickable elements, click on them, wait for the pop up to load, scrape the data, close the popup:
scraped_list = []
for button_instance in buttons_to_click:
button_instance.click()
#scrape the information you need here and append to scraped_list
#find close popup button
driver.find_element_by_xpath({XPATH TO ELEMENT}).click()
For this to work properly it is important to setup selenium's implicitly wait parameter. What that does is that if it doesn't find the required element it'll wait for X seconds until it is loaded, if X passes then it'll throw an error (you can handle the error in your code if it is expected).
In your case, you'll need the wait time because after you click on the product to display the popup, the information might take some seconds to load, in case you don't set implicitly wait, your script will exit with an element not found error. More information on selenium's wait parameters can be found here: https://selenium-python.readthedocs.io/waits.html
#put this line immediatelly after creating the driver object
driver.implicitly_wait(10) # seconds
***Suggestion:
I suggest you to always use Xpath when looking up elements, its syntax can actually emulate all other selenium selectors, is faster and will make an easier transition to C compiled html parsers (which you'll need in case you scale your scraper - recommend lxml a python package that uses a compiled parser)

Python-Selenium page scraping is not working properly

I am simply trying to open a web page through selenium web-driver. Clicks a button on it, interacts with some elements on second page.. etc..
I heard that selenium is best to work with python for this specified purpose so I wrote my code in it which works very fine at once. But gradually day after day the code which was working absolutely fine before ..just stopped working. Stopped interacting with page elements. Every time throw different errors. I am sick of this selenium behavior. Do anyone know why such so happens? Or can u suggest any good alternatives?
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(50)
cookie = driver.find_elements_by_xpath("//*[contains(text(), 'Decline')]")
cookie[0].click()
buttons = driver.find_elements_by_xpath("//button[contains(text(), 'Search')]")
buttons[0].click()
driver.implicitly_wait(50)
close = driver.find_elements_by_css_selector("button.close")
close[0].click()
parent = driver.find_elements_by_class_name("job-info")
for link in parent[:19]:
links = link.find_elements_by_tag_name('a')
hyperlink = random.choice(links)
driver.implicitly_wait(150)
driver.find_element_by_link_text(hyperlink.text).click()
driver.close()

Selenium download entire html

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.
I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.
I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.
How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.
If you have an alternative idea please also share.
My driver declaration
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
driverPath = '/usr/lib/chromium-browser/chromedriver'
def create_web_driver():
options = webdriver.ChromeOptions()
options.add_argument('headless')
# set the window size
options.add_argument('window-size=1200x600')
# try to initalize the driver
try:
driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
except WebDriverException:
print("failed to start driver at path: " + driverPath)
return driver
My url call my timeout = 20
driver.get(url)
time.sleep(timeout)
content = driver.page_source
content = content.encode('utf-8')
hashed_content = hashlib.sha512(content).hexdigest()
^ getting different hash here every time since same url not producing same web page
As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.
Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.
Solution
A couple of viable solutions:
A pottential solution could have been to induce WebDriverWait and ensure that some HTML elements are loaded as per the discussion How can I make sure if some HTML elements are loaded for Selenium + Python? validating atleast either of the following:
Page Title
Page Heading
Another solution would be to tweak the capability pageLoadStrategy. You can set the pageLoadStrategy for all the 1000 odd web pages to common point assigning a value either:
normal (full page load)
eager (interactive)
none
You can find a detailed discussion in How to make Selenium not wait till full page load, which has a slow script?
If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.
In my experience time.sleep() does not work well with dynamic loading times.
If the page is javascript-heavy you have to use the WebDriverWait clause.
Something like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get(url)
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))
Change 10 with whatever timer you want, and By.CSS_SELECTOR and its value with whatever type you want to use as a reference for a lo
You can also wrap the WebDriverWait around a Try/Except statement with the TimeoutException exception, which you can get from the submodule selenium.common.exceptions in case you want to set a hard limit.
You can probably set it inside a while loop if you truly want it to check forever until the page's loaded, because I couldn't find any reference in the docs about waiting "forever", but you'll have to experiment with it.

Categories

Resources