Can't get all the necessary links from web-page via Selenium - python

I'm currently trying to use some automation while performing a patent searching task. I'd like to get all the links corresponding to search query result. Particularly, I'm interested in Apple patents starting from the year 2015. So the code is the next one -
import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
new_driver_path = r"C:/Users/alexe/Desktop/Apple/PatentSearch/geckodriver-v0.30.0-win64/geckodriver.exe"
ops = options()
serv = Service(new_driver_path)
browser1 = selenium.webdriver.Firefox(service=serv, options=ops)
browser1.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")
elements = browser1.find_elements(By.CLASS_NAME, "search-result-item")
links = []
for elem in elements:
href = elem.get_attribute('href')
if href:
links.append(href)
links = set(links)
for href in links:
print(href)
And the output is the next one -
https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf
The problem here is that I've got 1 missing link -
result item and the missing link
So I've tried different selectors and still got the same result - one link is missing. I've also tried to search with different parameters and the pattern is the next one - all the missing links aren't linked with pdf output. I've spent a lot of time trying to figure out what's the reason, so I would be really grateful If you could provide me with any clue on the matter. Thanks in advance!

The option highlighted has no a tag with class pdflink in it. Put the line of code to extract the link in try block. If the required element is not found, search for the a tag available for that article.
Try like below once:
driver.get("https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new")
articles = driver.find_elements_by_tag_name("article")
print(len(articles))
for article in articles:
try:
link = article.find_element_by_xpath(".//a[contains(#class,'pdfLink')]").get_attribute("href") # Use a dot in the xpath to find an element with in an element.
print(link)
except:
print("Exception")
link = article.find_element_by_xpath(".//a").get_attribute("href")
print(link)
10
https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf
https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf
https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf
https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf
https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf
https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf
Exception
https://patents.google.com/?assignee=apple&after=priority:20150101&sort=new#
https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf
https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf
https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf

To extract all the href attributes of the pdfs using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.search-result-item[href]")))])
Using XPATH:
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(#class, 'search-result-item') and #href]")))])
Console Output:
['https://patentimages.storage.googleapis.com/86/30/47/7bc39ddf0e1ea7/KR20210106968A.pdf', 'https://patentimages.storage.googleapis.com/e6/16/c0/292a198e6f1197/AU2021218193A1.pdf', 'https://patentimages.storage.googleapis.com/3e/77/e0/b59cf47c2b30a1/AU2021212005A1.pdf', 'https://patentimages.storage.googleapis.com/c1/1a/c6/024f785fd5ea10/AU2021204695A1.pdf', 'https://patentimages.storage.googleapis.com/1b/3d/c2/ad77a8c9724fbc/AU2021204422A1.pdf', 'https://patentimages.storage.googleapis.com/ca/2a/bc/9380e1657c2767/US20210318798A1.pdf', 'https://patentimages.storage.googleapis.com/b3/19/cc/8dc1fae714194f/US20210312694A1.pdf', 'https://patentimages.storage.googleapis.com/ed/06/50/67e30960a7f68d/JP2021152951A.pdf', 'https://patentimages.storage.googleapis.com/ad/bc/0f/d1fcc65e53963e/US20210314041A1.pdf']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
PS: You can extract only nine(9) href attributes as one of the search items is a <span> element and isn't a link i.e. doesn't have the href attribute

Related

How to grab URL in "View Deal" and price for deal from kayak.com using BeautifulSoup

I have a list of Kayak URLs and I'd like to grap the price and link in "View Deal" for the "Best" and "Cheapest" HTML cards, essentially the first two results since I've already sorted the results in the URLs (here's an example of a URL).
I can't get around to grabbing these bits of data using beautifulsoup and I could use some help! Here's what I've tried for pulling price info but I'm getting an empty prices_list variable. Below is a screenshot of what exactly I'd like to pull info from in the website.
url = https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a
requests = 0
chrome_options = webdriver.ChromeOptions()
agents = ["Firefox/66.0.3","Chrome/73.0.3683.68","Edge/16.16299"]
print("User agent: " + agents[(requests%len(agents))])
chrome_options.add_argument('--user-agent=' + agents[(requests%len(agents))] + '"')
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome('/Users/etc./etc.')
driver.implicitly_wait(10)
driver.get(url)
# getting the prices
sleep(randint(8,10))
xp_prices = '//a[#class="booking-link"]/span[#class="price option-text"]'
prices = driver.find_elements_by_xpath(xp_prices)
prices_list = [price.text.replace('$','') for price in prices if price.text != '']
prices_list = list(map(int, prices_list))
There are 2 problems here with locator XPath:
The a element class name is not booking-link, but booking-link , with trailing space.
Your locator matching duplicating irrelevant (invisible) elements.
The following locator works:
"//div[#class='above-button']//a[contains(#class,'booking-link')]/span[#class='price option-text']"
So, the relevant code line could be:
xp_prices = "//div[#class='above-button']//a[contains(#class,'booking-link')]/span[#class='price option-text']"
To extract the prices from View Deal for the Best and Cheapest section within the website you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
From the Best section:
driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Best']//following::div[contains(#class, 'bottom-booking')]//a//div[contains(#class, 'price-text')]"))).text)
Console output:
$807
From the Cheapest section:
driver.get("https://www.kayak.com/flights/AMS-WMI,nearby/2023-02-15/WMI-SOF,nearby/2023-02-18/SOF-BEG,nearby/2023-02-20/BEG-MIL,nearby/2023-02-23/MIL-AMS,nearby/2023-02-25/?sort=bestflight_a")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Cheapest']//following::div[contains(#class, 'bottom-booking')]//a//div[contains(#class, 'price-text')]"))).text)
Console output:
$410
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

How to use XPath to scrape javascript website values

I'm trying to scrape (in python) the savings interest rate from this website using the value's xpath variable.
I've tried everything: beautifulsoup, selenium, etree, etc. I've been able to scrape a few other websites successfully. However, this site and many others are giving me fits. I'd love a solution that can scrape info from several sites regardless of their formatting using xpath variables.
My current attempt:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
service = Service(executable_path="/chromedriver")
options = Options()
options.add_argument(' — incognito')
options.headless = True
driver = webdriver.Chrome(service=service, options=options)
url = 'https://www.americanexpress.com/en-us/banking/online-savings/account/'
driver.get(url)
element = driver.find_element(By.XPATH, '//*[#id="hysa-apy-2"]')
print(element.text)
if element.text == "":
print("Error: Element text is empty")
driver.quit()
The interest rates are written inside span elements. All span elements which contain interest rates share the same class heading-6. But bear in mind, the result returns two span elements for each interest rate, each element for a different viewport.
The xpath selector:
'//span[#class="heading-6"]'
You can also get elements by containing text APY:
'//span[contains(., "APY")]'
But this selector looks for all span elements in the DOM that contain word APY.
If you find unique id, it is recommended to be priority, like this :find_element(By.ID,'hysa-apy-2') like #John Gordon comment.
But sometimes when the element found, the text not yet load.
Use xpath with add this logic and text()!=""
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[#id="hysa-apy-2" and text()!=""]')))
Following import:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

How to pull the href information from specific class using Selenium and Python

I'm currently working on some web scraping using python and selenium, and I can't seem to pull the link information from a href in an anchor tag for a specific class. for reference, its from zillow (specifically, this url : https://www.zillow.com/homes/for_rent/San-Francisco,-CA_rb/ ).
I've tried a few different options in order to select the anchor tag listed but can't seem to return the information i need :
links = driver.find_elements(By.CLASS_NAME, "list-card-info")
print(links[0].get_attribute('href'))
-- returns
None
also tried
links = driver.find_elements(By.CLASS_NAME, "list-card-top")
print(links[0].get_attribute('href'))
-- returns
None
links = driver.find_elements(By.CLASS_NAME, "list-card-link list-card-link-top-margin")
print(links[0].get_attribute('href'))
-- returns
None
and lastly
links = driver.find_elements(By.CSS_SELECTOR, "list-card-info.a")
print(links[0].get_attribute('href'))
I know I can pull all the anchor tags, but certainly there is a step im missing here to get the nested anchor tag value? or am i pulling the wrong class? not sure where im going wrong?
To print the value of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and using list slicing you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.zillow.com/san-francisco-ca/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-122.62421695117187%2C%22east%22%3A-122.24244204882812%2C%22south%22%3A37.70334422496088%2C%22north%22%3A37.84716973355808%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[class='list-card-top'] > a[href]")))])
Using XPATH in a single line:
driver.get('https://www.zillow.com/san-francisco-ca/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-122.62421695117187%2C%22east%22%3A-122.24244204882812%2C%22south%22%3A37.70334422496088%2C%22north%22%3A37.84716973355808%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='list-card-top']/a[#href]")))])
Console Output:
['https://www.zillow.com/homedetails/San-Francisco-CA-94134/15166498_zpid/', 'https://www.zillow.com/b/avery-450-san-francisco-ca-BTfktx/', 'https://www.zillow.com/b/solaire-san-francisco-ca-65g7KK/', 'https://www.zillow.com/homedetails/117-Saint-Charles-Ave-San-Francisco-CA-94132/15195262_zpid/', 'https://www.zillow.com/homedetails/433-40th-Ave-San-Francisco-CA-94121/15092586_zpid/', 'https://www.zillow.com/homedetails/123-Carl-St-San-Francisco-CA-94117/2078490576_zpid/', 'https://www.zillow.com/b/fifteen-fifty-san-francisco-ca-BdnYPc/', 'https://www.zillow.com/b/l-seven-san-francisco-ca-9NJtD7/', 'https://www.zillow.com/homedetails/4642-18th-St-San-Francisco-CA-94114/332858409_zpid/']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You could use XPATH to find the link (a tag) and use get_attribute('href') to get the link from the tag.
Like this:
href = driver.find_element(By.XPATH, '//div[#class="list-card-top"]/a').get_attribute('href')
print(href)
Another example:
href = driver.find_element(By.XPATH, '//div[#class="list-card-info"]/a').get_attribute('href')
print(href)
If you want to use By.CLASS_NAME, you could do it like this:
link = driver.find_element(By.CLASS_NAME, "list-card-top")
a = link.find_element(By.TAG_NAME, 'a').get_attribute('href')
print(href)
In your case:
links = driver.find_elements(By.CLASS_NAME, "list-card-info")
print(links[0].get_attribute('href'))
You're trying to find an attribute named 'href' in that div element with class list-card-info. We actually want to get the 'href' from the a tag inside that div.

How to scrape the href attributes of the top 10 clips from https://www.twitch.tv/directory/game/Overwatch/clips?range=7d using Selenium and Python

I have been having a consistent issue during webscraping of receiving an empty string instead of the expected results (based on inspect page html).
My specific goal is to get the link for the top 10 clips from https://www.twitch.tv/directory/game/Overwatch/clips?range=7d.
Here is my code:
# Gathers links of clips to download later
import bs4
import requests
from selenium import webdriver
from pprint import pprint
import time
from selenium.webdriver.common.keys import Keys
# Get links of multiple clips by webscraping main_url
main_url = 'https://www.twitch.tv/directory/game/Overwatch/clips?range=7d'
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(10)
elements_found = driver.find_elements_by_class_name("tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit")
print(elements_found)
driver.quit()
This is how I decided on the class name
The page uses Javascript and that is the reason why I am using Selenium over the Requests module (which I tried, to no success).
I added the time.sleep(10) so that I have time to scroll through the webpage to activate the java script, to no avail.
I've also tried changing user-agent and using XPaths, neither of which have produced different results.
No matter what I do, it seems that the program only looks at the raw html that is found by right-click -> inspect page source.
Any help and pointers would be greatly appreciated, I feel thoroughly stuck on this problem. I have been having these issues in all projects of "Chapter 11: Webscraping" from Automate the Boring Stuff, and my personal projects.
find_elements_by_class_name receive only one class as parameter so elements_found is an empty list. For example
find_elements_by_class_name('tw-interactive')
You are using 4 classes. To do that use css_selector
elements_found = find_elements_by_css_selector('.tw-interactive.tw-link.tw-link--hover-underline-none.tw-link--inherit')
Or explicitly
elements_found = find_elements_by_css_selector('[class="tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit"]')
To get the href attributes from the elements use get_attribute()
for element in elements_found:
element.get_attribute('href')
As per the documentation of selenium.webdriver.common.by implementation:
class selenium.webdriver.common.by.By
Set of supported locator strategies.
CLASS_NAME = 'class name'
So using find_elements_by_class_name() you won't be able to pass multiple class names i.e. tw-interactive, tw-link, tw-link--hover-underline-none and tw-link--inherit. Passing multiple classes you will face the error as:
Message: invalid selector: Compound class names not permitted
You can find a detailed discussion in Invalid selector: Compound class names not permitted using find_element_by_class_name with Webdriver and Python
Solution
As an alternative you can induce WebDriverWait for the visibility_of_all_elements_located() and you can use use either of the following Locator Strategies:
CSS_SELECTOR:
driver.get('https://www.twitch.tv/directory/game/Overwatch/clips?range=7d')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.tw-interactive.tw-link.tw-link--hover-underline-none.tw-link--inherit")))])
XPATH:
driver.get('https://www.twitch.tv/directory/game/Overwatch/clips?range=7d')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[#class='tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit']")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['https://www.twitch.tv/playoverwatch/clip/EnticingCoyTriangleM4xHeh', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/mteelul', 'https://www.twitch.tv/chipsa/clip/AgitatedGenerousFlyBleedPurple', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/playoverwatch/clip/StormyNimbleJamKappaClaus', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/zenofymedia', 'https://www.twitch.tv/sleepy/clip/BombasticCautiousEmuBIRB', 'https://www.twitch.tv/sleepy', 'https://www.twitch.tv/vlday', 'https://www.twitch.tv/playoverwatch/clip/FinePlainApeGrammarKing', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/supdos', 'https://www.twitch.tv/playoverwatch/clip/MotionlessHomelyWrenchNononoCat', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/theefisch', 'https://www.twitch.tv/sonicboom83/clip/WanderingInspiringConsoleM4xHeh', 'https://www.twitch.tv/sonicboom83', 'https://www.twitch.tv/vollg1', 'https://www.twitch.tv/chipsa/clip/PunchyStrongPonyStrawBeary', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/overwatchcontenders/clip/SavoryArtisticMelonEleGiggle', 'https://www.twitch.tv/overwatchcontenders', 'https://www.twitch.tv/asingledrop', 'https://www.twitch.tv/playoverwatch/clip/TubularLuckyLocustOptimizePrime', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/taipan20', 'https://www.twitch.tv/harbleu/clip/StrongStrongSushiDoggo', 'https://www.twitch.tv/harbleu', 'https://www.twitch.tv/aimmoth', 'https://www.twitch.tv/supertf/clip/GrossSmoothDolphinAMPTropPunch', 'https://www.twitch.tv/supertf', 'https://www.twitch.tv/tajin_ow', 'https://www.twitch.tv/playoverwatch/clip/TransparentCaringPoxVoteNay', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/nepptuneow', 'https://www.twitch.tv/space/clip/CharmingPeppyMetalFunRun', 'https://www.twitch.tv/space', 'https://www.twitch.tv/pantangelicious', 'https://www.twitch.tv/chipsa/clip/MoldyBadBananaRlyTho', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/mopedinspector', 'https://www.twitch.tv/kephrii/clip/SoftSullenInternTTours', 'https://www.twitch.tv/kephrii', 'https://www.twitch.tv/kephrii', 'https://www.twitch.tv/valentine_ow/clip/GorgeousSincereMinkBleedPurple', 'https://www.twitch.tv/valentine_ow', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/playoverwatch/clip/SpotlessTenuousTarsierPraiseIt', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/bluecloud123', 'https://www.twitch.tv/jake_ow/clip/TriumphantOptimisticQuailKAPOW', 'https://www.twitch.tv/jake_ow', 'https://www.twitch.tv/ph33rah', 'https://www.twitch.tv/playoverwatch/clip/DreamyDependableCheeseGOWSkull', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/carrosive']

Python Selenium - get href value

I am trying to copy the href value from a website, and the html code looks like this:
<p class="sc-eYdvao kvdWiq">
<a href="https://www.iproperty.com.my/property/setia-eco-park/sale-
1653165/">Shah Alam Setia Eco Park, Setia Eco Park
</a>
</p>
I've tried driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") but it returned 'list' object has no attribute 'get_attribute'. Using driver.find_element_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") returned None. But i cant use xpath because the website has like 20+ href which i need to copy all. Using xpath would only copy one.
If it helps, all the 20+ href are categorised under the same class which is sc-eYdvao kvdWiq.
Ultimately i would want to copy all the 20+ href and export them out to a csv file.
Appreciate any help possible.
You want driver.find_elements if more than one element. This will return a list. For the css selector you want to ensure you are selecting for those classes that have a child href
elems = driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq [href]")
links = [elem.get_attribute('href') for elem in elems]
You might also need a wait condition for presence of all elements located by css selector.
elems = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".sc-eYdvao.kvdWiq [href]")))
As per the given HTML:
<p class="sc-eYdvao kvdWiq">
Shah Alam Setia Eco Park, Setia Eco Park
</p>
As the href attribute is within the <a> tag ideally you need to move deeper till the <a> node. So to extract the value of the href attribute you can use either of the following Locator Strategies:
Using css_selector:
print(driver.find_element_by_css_selector("p.sc-eYdvao.kvdWiq > a").get_attribute('href'))
Using xpath:
print(driver.find_element_by_xpath("//p[#class='sc-eYdvao kvdWiq']/a").get_attribute('href'))
If you want to extract all the values of the href attribute you need to use find_elements* instead:
Using css_selector:
print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_css_selector("p.sc-eYdvao.kvdWiq > a")])
Using xpath:
print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_xpath("//p[#class='sc-eYdvao kvdWiq']/a")])
Dynamic elements
However, if you observe the values of class attributes i.e. sc-eYdvao and kvdWiq ideally those are dynamic values. So to extract the href attribute you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a"))).get_attribute('href'))
Using XPATH:
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//p[#class='sc-eYdvao kvdWiq']/a"))).get_attribute('href'))
If you want to extract all the values of the href attribute you can use visibility_of_all_elements_located() instead:
Using CSS_SELECTOR:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a")))])
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[#class='sc-eYdvao kvdWiq']/a")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
The XPATH
//p[#class='sc-eYdvao kvdWiq']/a
return the elements you are looking for.
Writing the data to CSV file is not related to the scraping challenge. Just try to look at examples and you will be able to do it.
To crawl any hyperlink or Href, proxycrwal API is ideal as it uses pre-built functions for fetching desired information. Just pip install the API and follow the code to get the required output. The second approach to fetch Href links using python selenium is to run the following code.
Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
list = ['https://www.heliosholland.com/Ampullendoos-voor-63-ampullen','https://www.heliosholland.com/lege-testdozen’]
driver = webdriver.Chrome()
wait = WebDriverWait(driver,29)
for i in list:
driver.get(i)
image = wait.until(EC.visibility_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[2]/div/div[2]/div/div/form/div[1]/div[1]/div/div/div/div[1]/div/img'))).get_attribute('src')
print(image)
To scrape the link, use .get_attribute(‘src’).
Get the whole element you want with driver.find_elements(By.XPATH, 'path').
To extract the href link use get_attribute('href').
Which gives,
driver.find_elements(By.XPATH, 'path').get_attribute('href')
try something like:
elems = driver.find_elements_by_xpath("//p[contains(#class, 'sc-eYdvao') and contains(#class='kvdWiq')]/a")
for elem in elems:
print elem.get_attribute['href']

Categories

Resources