I'm trying to scrape specific links on a website. I'm using Python and Selenium 4.8.
The HTML code looks like this with multiple lists, each containing a link:
<li>
<div class="programme programme xxx" >
<div class="programme_body">
<h4 class="programme titles">
<a class="br-blocklink__link" href="https://www.example_link1.com">
</a>
</h4>
</div>
</div>
</li>
<li>...</li>
<li>...</li>
So each < li > contains a link.
Ideally, I would like a python list with all the hrefs which I can then iterate through to get additional output.
Thank you for your help!
You can try something like below (untested, as you didn't confirm the url):
[...]
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(driver, 25)
[...]
wanted_elements = [x.get_attribute('href') for x in wait.until(EC.presence_of_all_elements_located((By.XPATH, '//li//h4[#class="programme titles"]/a[#class="br-blocklink__link"]')))]
Selenium documentation can be found here.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com")
lis = driver.find_elements_by_xpath('//li//a[#class="br-blocklink__link"]')
hrefs = []
for li in lis:
hrefs.append(li.get_attribute('href'))
driver.quit()
This will give you a list hrefs with all the hrefs from the website. You can then iterate through this list and use the links for further processing.
Related
I'm doing a scraping process using selenium in which my goal is to extract the views, likes, comments and shares of the videos that are made to an audio in TikTok.
In the process I found this path:
<div data-e2e="music-item-list" mode="compact" class="tiktok-yvmafn-DivVideoFeedV2 e5w7ny40">
This contains the different videos of the audio, however it is inside a <div> and not <li>.
div dependency
How do I convert the divs contained in the path into a list that I can manipulate?
This is what I did:
url = 'https://www.tiktok.com/music/Sweater-Weather-Sped-Up-7086537183875599110'
driver.get(url)
posts = driver.find_element(By.XPATH, '//div[#data-e2e="music-item-list"]')
post1 = posts[0]
A proper way to locate those elements would be too wait for them in a first instance, and then locate them as a list, then access them:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(driver, 20)
[...]
posts = wait.until(EC.presence_of_all_elements_located((By.XPATH , '//div[#data-e2e="music-item-list"]/div')))
for post in posts:
print(post.text)
Selenium documentation: https://www.selenium.dev/documentation/
We are working on extracting the image source address from the page.
<div class="product-row">
<div class="product-item">
<div class="product-picture"><img src="https://t3a.coupangcdn.com/thumbnails/remote/212x212ex/image/vendor_inventory/6ca9/2e097d911efc291473d0c47052cdc8f42d7b7b8f2a3ebbb0ccc974d76fe4.jpg" alt="product"><div><button type="button" class="ant-btn hover-btn btn-open-detail">
</div></div>
<div class="product-item">
<div class="product-picture">
<img src="https://thumbnail11.coupangcdn.com/thumbnails/remote/212x212ex/image/retail/images/239519218793467-6edc7d92-4165-4476-a528-fa238ffeeeb6.jpg" alt="product"><div></div></div>
I tried to get it in the following way:
ele = driver.find_elements_by_xpath("//div[#class='product-picture']/img")
print(ele)
Output:
<selenium.webdriver.remote.webelement.WebElement (session="d9fd08b93bd5dd83fe520826c1f6fd77", element="27ef8c33-624d-4166-9dc7-3a355c4dcc32")>
<selenium.webdriver.remote.webelement.WebElement (session="d9fd08b93bd5dd83fe520826c1f6fd77", element="a6d77107-fecf-4c84-a048-9b4bda39b9df")>
<selenium.webdriver.remote.webelement.WebElement (session="d9fd08b93bd5dd83fe520826c1f6fd77", element="1f62cb8b-df58-4f06-afe6-6c60cb572527")>
I want the image source address string of every <div class="product-picture"> element on the page. Is there a way to extract a string?
from selenium.webdriver.common.by import By
images = driver.find_elements(By.XPATH, "//div[#class='product-picture']/img")
for img in images:
print(img.get_attribute("src"))
This will give you the expected output:
https://t3a.coupangcdn.com/thumbnails/remote/212x212ex/image/vendor_inventory/6ca9/2e097d911efc291473d0c47052cdc8f42d7b7b8f2a3ebbb0ccc974d76fe4.jpg"
https://thumbnail11.coupangcdn.com/thumbnails/remote/212x212ex/image/retail/images/239519218793467-6edc7d92-4165-4476-a528-fa238ffeeeb6.jpg
Try to use get_attribute('src') method to grab the src value
ele = driver.find_elements_by_xpath("//div[#class='product-picture']/img").get_attribute('src')
You are using deprecated syntax. Please see Python Selenium warning "DeprecationWarning: find_element_by_* commands are deprecated"
The optimal way of locating elements which are likely to be lazy loading would be:
images = WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='product-picture']/img")))
for i in images:
print(i.get_attribute('src')
You will also need the following imports:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Selenium docs can be found at https://www.selenium.dev/documentation/
I am currently looping through all labels and extracting data from each page, however I cannot extract the text highlighted below each category (i.e. Founded, Location etc..). The text appears to be within " " and above the br tags, can anyone advise how to extract please?
Website -
https://labelsbase.net/knee-deep-in-sound
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Founded</span>
</div>
</div>
2003
<br>
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Location</span>
</div>
</div>
United Kingdom
<br>
I have tried using driver.find_elements_by_xpath & driver.execute_script but cannot find a solution.
Error Message -
Message: invalid selector: The result of the xpath expression "/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]" is: [object Text]. It should be an element.
Screenshot
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
import string
PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(PATH)
wait = WebDriverWait(driver, 10)
links = []
url = 'https://labelsbase.net/knee-deep-in-sound'
driver.get(url)
time.sleep(5)
# -- Title
title = driver.find_element_by_class_name('label-name').text
print(title,'\n')
# -- Image
image = driver.find_element_by_tag_name('img')
src = image.get_attribute('src')
print(src,'\n')
# -- Founded
founded = driver.find_element_by_xpath("/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]").text
print(founded,'\n')
driver.quit()
Can you check with this
founded = driver.find_element_by_xpath("//*[#*='block-content']").get_attribute("innerText")
You can take the XPath of the class="block-content"
O/P
I am trying to scrape the following record table in familysearch.org. I am using the Chrome webdriver with Python, using BeautifulSoup and Selenium.
Upon inspecting the page I am interested in, I wanted to scrape from the following bit in HTML. Note this is only one element part of a familysearch.org table that has 100 names.
<span role="cell" class="td " name="name" aria-label="Name"> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <span><sr-cell-name name="Jame Junior " url="ZS" relationship="Principal" collection-name="Index"></sr-cell-name></span> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> </span>
Alternatively, the name also shows in this bit of HTML
<a class="name" href="/ark:ZS">Jame Junior </a>
From all of this, I only want to get the name "Jame Junior", I have tried using driver.find.elements_by_class_name("name"), but it prints nothing.
This is the code I used
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from getpass import getpass
username = input("Enter Username: " )
password = input("Enter Password: ")
chrome_path= r"C:\Users...chromedriver_win32\chromedriver.exe"
driver= webdriver.Chrome(chrome_path)
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=...")
usernamet = driver.find_element_by_id("userName")
usernamet.send_keys(username)
passwordt = driver.find_element_by_id("password")
passwordt.send_keys(password)
login = driver.find_element_by_id("login")
login.submit()
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=.....")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
#for tag in driver.find_elements_by_class_name("name"):
# print(tag.get_attribute('innerHTML'))
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
Try to access the sr-cell-name tag.
Selenium:
for tag in driver.find_elements_by_tag_name("sr-cell-name"):
print(tag.get_attribute("name"))
BeautifulSoup:
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
EDIT: You might need to wait for the element to fully appear on the page before parsing it. You can do this using the presence_of_element_located method:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
for tag in driver.find_elements_by_class_name("name"):
print(tag.get_attribute('innerHTML'))
I was looking to do something very similar and have semi-decent python/selenium scraping experience. Long story short, FamilySearch (and many other sites, I'm sure) use some kind of technology (I'm not a JS or web guy) that involves shadow host. The tags are essentially invisible to BS or Selenium.
Solution: pyshadow
https://github.com/sukgu/pyshadow
You may also find this link helpful:
How to handle elements inside Shadow DOM from Selenium
I have now been able to successfully find elements I couldn't before, but am still not all the way where I'm trying to get. Good luck!
I want to retrieve from a website every div class='abcd' element using Selenium together with 'waiter' and 'XPATH' classes from Explicit.
The source code is something like this:
<div class='abcd'>
<a> Something </a>
</div>
<div class='abcd'>
<a> Something else </a>
...
When I run the following code (Python) I get only 'Something' as a result. I'd like to iterate over every instance of the div class='abcd' appearing in the source code of the website.
from explicit import waiter, XPATH
from selenium import webdriver
driver = webdriver.Chrome(PATH)
result = waiter.find_element(driver, "//div[#class='abcd']/a", by=XPATH).text
Sorry if the explanation isn't too technical, I'm only starting with webscraping. Thanks
I've used like this. You can also use if you like this procedure.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(PATH)
css_selector = "div.abcd"
results = WebDriverWait(driver, 10).until((expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, css_selector))))
for result in results:
print(result.text)