Fetch all href link using selenium in python - python

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.
For example, I want all the links in the href= property of all the <a> tags on http://psychoticelites.com/
I've written a script and it is working. But, it's giving me the object address. I've tried using the id tag to get the value, but, it doesn't work.
My current script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")
assert "Psychotic" in driver.title
continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[#href]")
#x = str(continue_link)
#print(continue_link)
print(elem)

Well, you have to simply loop through the list:
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
find_elements_by_* returns a list of elements (note the spelling of 'elements'). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).

I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.
elems = driver.find_elements_by_tag_name('a')
for elem in elems:
href = elem.get_attribute('href')
if href is not None:
print(href)

driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
driver.close()
Note: Adding delay is very important. First run it in debug mode and Make sure your URL page is getting loaded. If the page is loading slowly, increase delay (sleep time) and then extract.
If you still face any issues, please refer below link (explained with an example) or comment
Extract links from webpage using selenium webdriver

You can try something like:
links = driver.find_elements_by_partial_link_text('')

You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:
https://pypi.python.org/pypi/htmldom/2.0
from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")
dom = dom.createDom()
The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call "createDom" method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.
You can query the elements using the "find" method of HtmlDom object:
p_links = dom.find("a")
for link in p_links:
print ("URL: " +link.attr("href"))
The above code will print all the links/urls present on the web page

Unfortunately, the original link posted by OP is dead...
If you're looking for a way to scrape links on a page, here's how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:
from gazpacho import Soup
url = "https://stackoverflow.com/q/34759787/3731467"
soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")
[a.attrs["href"] for a in a_tags]

You can do this by using BeautifulSoup with very easy and efficient way. I have tested the below codes and worked fine for the same purpose.
After this line -
driver.get("http://psychoticelites.com/")
use the below code -
response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
if link.get('href'):
print(link.get("href"))
print('\n')

All of the accepted answers using Selenium's driver.find_elements_by_*** no longer work with Selenium 4. The current method is to use find_elements() with the By class.
Method 1: For loop
The below code utilizes 2 lists. One for By.XPATH and the other, By.TAG_NAME. One can use either-or. Both are not needed.
By.XPATH IMO is the easiest as it does not return a seemingly useless None value like By.TAG_NAME does. The code also removes duplicates.
from selenium.webdriver.common.by import By
driver.get("https://www.amazon.com/")
href_links = []
href_links2 = []
elems = driver.find_elements(by=By.XPATH, value="//a[#href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
for elem in elems:
l = elem.get_attribute("href")
if l not in href_links:
href_links.append(l)
for elem in elems2:
l = elem.get_attribute("href")
if (l not in href_links2) & (l is not None):
href_links2.append(l)
print(len(href_links)) # 360
print(len(href_links2)) # 360
print(href_links == href_links2) # True
Method 2: List Comprehention
If duplicates are OK, one liner list comprehension can be used.
from selenium.webdriver.common.by import By
driver.get("https://www.amazon.com/")
elems = driver.find_elements(by=By.XPATH, value="//a[#href]")
href_links = [e.get_attribute("href") for e in elems]
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2] # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]
print(len(href_links)) # 387
print(len(href_links2)) # 387
print(href_links == href_links2) # True

import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
print(link)

Update for the existing solving Post:
For the current version it needs to be:
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Related

Selenium webscraper not scraping desired tags

here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))

Get multiple elements by tag with Python and Selenium

My code goes into a website and scrapes rows of information (title and time).
However, there is one tag ('p') that I am not sure how to get using 'get element by'.
On the website, it is the information under each title.
Here is my code so far:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome()
driver.get('https://www.nutritioncare.org/ASPEN21Schedule/#tab03_19')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
eachRow = driver.find_elements_by_class_name('timeline__item')
time.sleep(1)
for item in eachRow:
time.sleep(1)
title = item.find_element_by_class_name('timeline__item-title')
tim = item.find_element_by_class_name('timeline__item-time')
tex = item.find_element_by_tag_name('p') # This is the part I don’t know how to scrape
print(title.text, tim.text, tex.text)
I checked the page and there are several p tags, I suggest to use find_elements_by_tag_name instead of find_element_by_tag_name (to get all the p tags including the p tag that you want) and iterate over all the p tags elements and then join the text content and do strip on it.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
driver.get('https://www.nutritioncare.org/ASPEN21Schedule/#tab03_19')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
eachRow = driver.find_elements_by_class_name('timeline__item')
time.sleep(1)
for item in eachRow:
time.sleep(1)
title=item.find_element_by_class_name('timeline__item-title')
tim=item.find_element_by_class_name('timeline__item-time')
tex=item.find_elements_by_tag_name('p')
text = " ".join([i.text for i in tex]).strip()
print(title.text,tim.text, text)
Since the webpage has several p tags, it would be better to use the .find_elements_by_class() method. Replace the print call in the code with the following:
print(title.text,tim.text)
for t in tex:
if t.text == '':
continue
print(t.text)
Maybe try using different find_elements_by_class... I don't use Python that much, but try this unless you already have.

Get an empty list of XPATH expression in python

I have watched a video at this link https://www.youtube.com/watch?v=EELySnTPeyw and this is the code ( I have changed the xpath as it seems the website has been changed)
import selenium.webdriver as webdriver
def get_results(search_term):
url = 'https://www.startpage.com'
browser = webdriver.Chrome(executable_path="D:\\webdrivers\\chromedriver.exe")
browser.get(url)
search_box = browser.find_element_by_id('q')
search_box.send_keys(search_term)
try:
links = browser.find_elements_by_xpath("//a[contains(#class, 'w-gl__result-title')]")
except:
links = browser.find_lemets_by_xpath("//h3//a")
print(links)
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
browser.close()
get_results('cat')
The code works well as for the part of opening the browser and navigating to the search box and sending keys but as for the links return an empty list although I have manually searched for the xpath in the developer tools and it returns 10 results.
You need to add keys.enter to your search. You weren't on the next page.
search_box.send_keys(search_term+Keys.ENTER)
Import
from selenium.webdriver.common.keys import Keys
Outputs
https://en.wikipedia.org/wiki/Cat
https://www.cat.com/en_US.html
https://www.cat.com/
https://www.youtube.com/watch?v=cbP2N1BQdYc
https://icatcare.org/advice/thinking-of-getting-a-cat/
https://www.caterpillar.com/en/brands/cat.html
https://www.petfinder.com/cats/
https://www.catfootwear.com/US/en/home
https://www.aspca.org/pet-care/cat-care/general-cat-care
https://www.britannica.com/animal/cat

About the Selenium Python

I'm trying to use the Python Selenium to make some web automation.
For now, I can get all the href from the web.
The question is I'm trying to get the href only the [i class="icon as--c like btn_icon" ] [span]number[/span] <-- the number is >= 1 if there is no number it won't get the href
this is the code with number >= 1
this is the code with no number in it
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("") # link
for i in range(1,10):
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
class_name = driver.find_elements_by_class_name("tile__covershot")
links = [elem.get_attribute('href') for elem in class_name]
for link in links:
print(link)
You may use relative xpath to get what you wanted. On your case, something like this
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("") # link
for i in range(1,10):
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
elem_with_number_xpath = "(//span[contains(text(),'1')]|//span[contains(text(),'2')]|//span[contains(text(),'3')]|//span[contains(text(),'4')]|//span[contains(text(),'5')]|//span[contains(text(),'6')]|//span[contains(text(),'7')]|//span[contains(text(),'8')]|//span[contains(text(),'9')])/ancestor::div[2]/preceding::a[1]"
elems_with_number = driver.find_elements_by_xpath(elem_with_number_xpath)
links = [elem.get_attribute('href') for elem in elems_with_number]
for link in links:
print(link)
Grab the ahref tags loop through them and check if they aren't empty or > 1 and if they are append to list.
elems=driver.find_elements_by_xpath("//i[class='icon as--c like btn_icon']/span/ancestor::div[2]/preceding::a[1]")
lst=[]
for elem in elems:
val=elem.find_element_by_xpath('//i/span')
if len(val.text)!=0:
if int(val.text)>1:
lst.append(elem.get_attribute('href'))

How to check if a web element is visible

I am using Python with BeautifulSoup4 and I need to retrieve visible links on the page. Given this code:
soup = BeautifulSoup(html)
links = soup('a')
I would like to create a method is_visible that checks whether or not a link is displayed on the page.
Solution Using Selenium
Since I am working also with Selenium I know that there exist the following solution:
from selenium.webdriver import Firefox
firefox = Firefox()
firefox.get('https://google.com')
links = firefox.find_elements_by_tag_name('a')
for link in links:
if link.is_displayed():
print('{} => Visible'.format(link.text))
else:
print('{} => Hidden'.format(link.text))
firefox.quit()
Performance Issue
Unfortunately the is_displayed method and getting the text attribute perform a http request to retrieve such informations. Therefore things can get really slow when there are many links on a page or when you have to do this multiple times.
On the other hand BeautifulSoup can perform these parsing operations in zero time once you get the page source. But I can't figure out how to do this.
AFAIK, BeautifulSoup will only help you parse the actual markup of the HTML document anyway. If that's all you need, then you can do it in a manner like so (yes, I already know it's not perfect):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
def is_visible_1(link):
#do whatever in this function you can to determine your markup is correct
try:
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
def is_visible_2(**kwargs):
try:
soup = kwargs.get('soup', None)
del kwargs['soup']
#Exception thrown if element can't be found using kwargs
link = soup.find_all(**kwargs)[0]
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
#checks links that already exist, not *if* they exist
for link in soup.find_all('a'):
print(str(is_visible_1(link)))
#checks if an element exists
print(str(is_visible_2(soup=soup,id='someID')))
BeautifulSoup doesn't take into account other parties that will tell you that the element is_visible or not, like: CSS, Scripts, and dynamic DOM changes. Selenium, on the other hand, does tell you that an element is actually being rendered or not and generally does so through accessibility APIs in the given browser. You must decide if sacrificing accuracy for speed is worth pursuing. Good luck! :-)
try with find_elements_by_xpath and execute_script
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.com/?hl=en")
links = driver.find_elements_by_xpath('//a')
driver.execute_script('''
var links = document.querySelectorAll('a');
links.forEach(function(a) {
a.addEventListener("click", function(event) {
event.preventDefault();
});
});
''')
visible = []
hidden = []
for link in links:
try:
link.click()
visible.append('{} => Visible'.format(link.text))
except:
hidden.append('{} => Hidden'.format(link.get_attribute('textContent')))
#time.sleep(0.1)
print('\n'.join(visible))
print('===============================')
print('\n'.join(hidden))
print('===============================\nTotal links length: %s' % len(links))
driver.execute_script('alert("Finish")')

Categories

Resources