Selenium how to extract href and label name Python? - python

I'm trying to pull the href and the data-promoname from the
URL:
https://www2.deloitte.com/global/en/pages/about-deloitte/topics/combating-covid-19-with-resilience.html?icid=covid-19_article-nav
I tried the code below but can only extract href under the class "promo-focus", but I also want to get the COVID-19 Economic cases: Scenarios for business leaders from data-promoname
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
url = "https://www2.deloitte.com/global/en/pages/about-deloitte/topics/combating-covid-19-with-resilience.html?icid=covid-19_article-nav"
driver.get(url)
for i in driver.find_elements_by_class_name('promo-focus'):
print(i.get_attribute('href'))
Can anyone tell me how to do that using Python?

Try using the text method to get the text.
Example
from selenium import webdriver
chrome_browser = webdriver.Chrome()
url = "https://www2.deloitte.com/global/en/pages/about-deloitte/topics/combating-covid-19-with-resilience.html?icid=covid-19_article-nav"
chrome_browser.get(url)
for a in chrome_browser.find_elements_by_class_name('promo-focus'):
print(a.get_attribute('href'))
print(a.text)

To get the value from data-promoname you can do this by using .get_attribute method. This method can be used to get the value of any attribute corresponding to its tag.
driver_path = 'C:/chromedriver.exe' #the path to your chrome driver
browser = webdriver.Chrome(driver_path)
url_to_open = 'https://www2.deloitte.com/global/en/pages/about-deloitte/topics/combating-covid-19-with-resilience.html?icid=covid-19_article-nav'
browser.get(url_to_open)
for a in browser.find_elements_by_class_name('promo-focus'):
print(a.get_attribute('href'))
print(a.get_attribute("data-promoname"))
If you are looking for the content being displayed on the page under the anchor tags, you can use .text instead
print(a.text)

Related

Selenium webscraper not scraping desired tags

here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))

How to select particular region and scrape all the Jobs from a website

I am trying to web scrape all the Jobs from a Job portal by selecting a particular country.
I am sorry to affix a picture but the intent to show you how the page looks like.
What i tried:
Below is what i tried but i;m not getting anything just started learning web scraping ..
import requests
from bs4 import BeautifulSoup
job_url = 'https://wd3.myworkdayjobs.com/careers/'
out_req = requests.get(job_url)
soup = BeautifulSoup(out_req.text, 'html.parser')
print(soup)
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
any help will be much appreciated.
Try selenium library, Search based on attributes & After search results scrape using beautiful soup.
from selenium import webdriver
#browser exposes an executable file
#Through Selenium test we will invoke the executable file which will then #invoke actual browser
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
# to maximize the browser window
driver.maximize_window()
#get method to launch the URL
driver.get("Website")
#to refresh the browser
driver.refresh()
# identifying the checkboxes with type attribute in a list
chk =driver.find_elements_by_xpath("//input[#type='checkbox']")
# len method is used to get the size of that list
print(len(chk))
# get_attribute method is get the value attribute
for i in chk:
if i.get_attribute("value") == "United states of America":
i.click()
#to close the browser
driver.close()
#############################
#Beautiful soup code here
#############################

Get an empty list of XPATH expression in python

I have watched a video at this link https://www.youtube.com/watch?v=EELySnTPeyw and this is the code ( I have changed the xpath as it seems the website has been changed)
import selenium.webdriver as webdriver
def get_results(search_term):
url = 'https://www.startpage.com'
browser = webdriver.Chrome(executable_path="D:\\webdrivers\\chromedriver.exe")
browser.get(url)
search_box = browser.find_element_by_id('q')
search_box.send_keys(search_term)
try:
links = browser.find_elements_by_xpath("//a[contains(#class, 'w-gl__result-title')]")
except:
links = browser.find_lemets_by_xpath("//h3//a")
print(links)
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
browser.close()
get_results('cat')
The code works well as for the part of opening the browser and navigating to the search box and sending keys but as for the links return an empty list although I have manually searched for the xpath in the developer tools and it returns 10 results.
You need to add keys.enter to your search. You weren't on the next page.
search_box.send_keys(search_term+Keys.ENTER)
Import
from selenium.webdriver.common.keys import Keys
Outputs
https://en.wikipedia.org/wiki/Cat
https://www.cat.com/en_US.html
https://www.cat.com/
https://www.youtube.com/watch?v=cbP2N1BQdYc
https://icatcare.org/advice/thinking-of-getting-a-cat/
https://www.caterpillar.com/en/brands/cat.html
https://www.petfinder.com/cats/
https://www.catfootwear.com/US/en/home
https://www.aspca.org/pet-care/cat-care/general-cat-care
https://www.britannica.com/animal/cat

How to get iframe source from page_source when the id isn't on the iframe

Hello today i wanna ask how to get the link inside the page source but without id, i asked before how to get the link with id ok now i understand, but i've tried the same method with another link and i was not successful about that so here is my code:
from selenium import webdriver
# Create a new instance of the Firefox driver
driver_path = r"C:\Users\666\Desktop\New folder (8)\chromedriver.exe"
driver = webdriver.Chrome(driver_path)
# go to the google home page
driver.get("https://www.gledalica.com/sa-prevodom/iron-fist-s02e01-video_02cb355f8.html")
# find the element that's name attribute is q (the google search box)
element = driver.find_element_by_id("Playerholder")
frame = driver.find_element_by_tag_name("iframe")
driver.switch_to.frame("iframe")
link = frame.get_attribute("src")
driver.quit()
Like this here: enter image description here
There are multiple way to get it. In this case one of easiest is by using a CSS selector:
frame = find_element_by_css_selector('#Playerholder iframe')
This looks for the element with id = "Playerholder" in the html and then look for a child of it that is an iframe.

How to get link from elements with Selenium and Python

Let's say all Author/username elements in one webpage look like this...
How can I get to the href part using python and Selenium?
users = browser.find_elements_by_xpath(?)
<span>
Author:
<a href="/account/57608-bob">
bob
</a>
</span>
Thanks.
Use find_elements_by_tag_name('a') to find the 'a' tags, and then use get_attribute('href') to get the link string.
Use .//span[contains(text(), "Author")]/a as xpath expression.
For example:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://jsfiddle.net/9pKMU/show/')
for a in driver.find_elements_by_xpath('.//span[contains(text(), "Author")]/a'):
print(a.get_attribute('href'))
Using this code you can get the all links from a webpage
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://your website/")
# identify elements with tagname <a>
lnks=driver.find_elements_by_tag_name("a")
# traverse list
for lnk in lnks:
# get_attribute() to get all href
print(lnk.get_attribute("href"))
driver.quit()

Categories

Resources