Selenium - How to jump from one sibling to other - python

I am using Selenium-Python to Scrape the content at this link.
http://targetstudy.com/school/62292/universal-academy/
HTML code is like this,
<tr>
<td>
<i class="fa fa-mobile">
::before
</i>
</td>
<td>8349992220, 8349992221</td>
</tr>
I am not sure how to get the mobile numbers in using the class="fa fa-mobile"
Could someone please help. Thanks
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
import lxml.html
from selenium.common.exceptions import NoSuchElementException
path_to_chromedriver = 'chromedriver.exe'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get('http://targetstudy.com/school/62292/universal-academy/')
stuff = browser.page_source.encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
address1 = tree.xpath('//td/i[#class="fa fa-mobile"]/parent/following-sibling/following-sibling::text()')
print address1

You don't need lxml.html for this. Selenium is very powerful in Locating Elements.
Pass //i[#class="fa fa-mobile"]/../following-sibling::td xpath expression to find_element_by_xpath():
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> browser.get('http://targetstudy.com/school/62292/universal-academy/')
>>> browser.find_element_by_xpath('//i[#class="fa fa-mobile"]/../following-sibling::td').text
u'83499*****, 83499*****'
Note, added * for not showing the real numbers here.
Here the xpath first locates the i tag with the fa fa-mobile class, then goes to the parent and gets the next following td sibling element.
Hope that helps.

Related

Find specific link on a website with Selenium (in Python)

I'm trying to scrape specific links on a website. I'm using Python and Selenium 4.8.
The HTML code looks like this with multiple lists, each containing a link:
<li>
<div class="programme programme xxx" >
<div class="programme_body">
<h4 class="programme titles">
<a class="br-blocklink__link" href="https://www.example_link1.com">
</a>
</h4>
</div>
</div>
</li>
<li>...</li>
<li>...</li>
So each < li > contains a link.
Ideally, I would like a python list with all the hrefs which I can then iterate through to get additional output.
Thank you for your help!
You can try something like below (untested, as you didn't confirm the url):
[...]
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(driver, 25)
[...]
wanted_elements = [x.get_attribute('href') for x in wait.until(EC.presence_of_all_elements_located((By.XPATH, '//li//h4[#class="programme titles"]/a[#class="br-blocklink__link"]')))]
Selenium documentation can be found here.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com")
lis = driver.find_elements_by_xpath('//li//a[#class="br-blocklink__link"]')
hrefs = []
for li in lis:
hrefs.append(li.get_attribute('href'))
driver.quit()
This will give you a list hrefs with all the hrefs from the website. You can then iterate through this list and use the links for further processing.

Python Selenium - Extract text within <br>

I am currently looping through all labels and extracting data from each page, however I cannot extract the text highlighted below each category (i.e. Founded, Location etc..). The text appears to be within " " and above the br tags, can anyone advise how to extract please?
Website -
https://labelsbase.net/knee-deep-in-sound
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Founded</span>
</div>
</div>
2003
<br>
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Location</span>
</div>
</div>
United Kingdom
<br>
I have tried using driver.find_elements_by_xpath & driver.execute_script but cannot find a solution.
Error Message -
Message: invalid selector: The result of the xpath expression "/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]" is: [object Text]. It should be an element.
Screenshot
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
import string
PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(PATH)
wait = WebDriverWait(driver, 10)
links = []
url = 'https://labelsbase.net/knee-deep-in-sound'
driver.get(url)
time.sleep(5)
# -- Title
title = driver.find_element_by_class_name('label-name').text
print(title,'\n')
# -- Image
image = driver.find_element_by_tag_name('img')
src = image.get_attribute('src')
print(src,'\n')
# -- Founded
founded = driver.find_element_by_xpath("/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]").text
print(founded,'\n')
driver.quit()
Can you check with this
founded = driver.find_element_by_xpath("//*[#*='block-content']").get_attribute("innerText")
You can take the XPath of the class="block-content"
O/P

How to handle drop down list with div tag and multiple attributes?

I am trying to automate the location selection process, however I am struggle with it.
So for, I can only open the menu and select the first item.
And my code is:
import bs4
from bs4 import BeautifulSoup
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www.ebay.com/b/Food-Beverages/14308/bn_1642947?listingOnly=1&rt=nc'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
button = driver.find_element_by_id('gh-shipto-click') //find the location button
button.click()
button2 = driver.find_element_by_id('gh-shipto-click-body-cnt') //open the menu
button2.click()
driver.find_element(By.XPATH,"//div[#role='menuitemradio']").click() //choose the first location
I believe the attribute "data-makeup-index" (show in the pic) would help, but I don't know how to use it.
Sine some of you may not able to find the "ship to" button. Here is the html code I copied from the web.
<li id="gh-shipto-click" class="gh-eb-li">
<div class="gh-menu">
<button _sp="m570.l46241" title="Ship to" class="gh-eb-li-a gh-icon" aria-expanded="false" aria-controls="gh-shipto-click-o" aria-label="Ship to Afghanistan"><i class="flgspr gh-flag-bg flaaf"></i><span>Ship to</span></button>
<i class="flgspr gh-flag-bg flaaf"></i>
I have found my answer as below:
driver.find_element(By.XPATH,"//div[#role='menuitemradio'][{an integer}]").click()
Although the Ship to was not visible to my location, I've inspected it from different Geo location and guess what I was able to find this elementstyle="display: none;
I've removed the none temporarily, and it got displayed.
<li id="gh-shipto-click" class="gh-eb-li gh-modal-active" style="display:none;" aria-hidden="false">
You could find the element by these XPath and handle the dropdown
#This XPath is taking you to the specific div where you need to provide the element index [1][2][3] the ID of this XPath is dynamic
driver.find_element_by_xpath("((((//*[contains(#id,'nid-')])[2])//child::div/div")
driver.find_element_by_xpath("//*[#class='cn']")
I know you got your solution, but this what I've tried It might help

Use Python to Scrape for Data in Family Search Records

I am trying to scrape the following record table in familysearch.org. I am using the Chrome webdriver with Python, using BeautifulSoup and Selenium.
Upon inspecting the page I am interested in, I wanted to scrape from the following bit in HTML. Note this is only one element part of a familysearch.org table that has 100 names.
<span role="cell" class="td " name="name" aria-label="Name"> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <span><sr-cell-name name="Jame Junior " url="ZS" relationship="Principal" collection-name="Index"></sr-cell-name></span> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> </span>
Alternatively, the name also shows in this bit of HTML
<a class="name" href="/ark:ZS">Jame Junior </a>
From all of this, I only want to get the name "Jame Junior", I have tried using driver.find.elements_by_class_name("name"), but it prints nothing.
This is the code I used
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from getpass import getpass
username = input("Enter Username: " )
password = input("Enter Password: ")
chrome_path= r"C:\Users...chromedriver_win32\chromedriver.exe"
driver= webdriver.Chrome(chrome_path)
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=...")
usernamet = driver.find_element_by_id("userName")
usernamet.send_keys(username)
passwordt = driver.find_element_by_id("password")
passwordt.send_keys(password)
login = driver.find_element_by_id("login")
login.submit()
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=.....")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
#for tag in driver.find_elements_by_class_name("name"):
# print(tag.get_attribute('innerHTML'))
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
Try to access the sr-cell-name tag.
Selenium:
for tag in driver.find_elements_by_tag_name("sr-cell-name"):
print(tag.get_attribute("name"))
BeautifulSoup:
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
EDIT: You might need to wait for the element to fully appear on the page before parsing it. You can do this using the presence_of_element_located method:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
for tag in driver.find_elements_by_class_name("name"):
print(tag.get_attribute('innerHTML'))
I was looking to do something very similar and have semi-decent python/selenium scraping experience. Long story short, FamilySearch (and many other sites, I'm sure) use some kind of technology (I'm not a JS or web guy) that involves shadow host. The tags are essentially invisible to BS or Selenium.
Solution: pyshadow
https://github.com/sukgu/pyshadow
You may also find this link helpful:
How to handle elements inside Shadow DOM from Selenium
I have now been able to successfully find elements I couldn't before, but am still not all the way where I'm trying to get. Good luck!

Wait for every element in Selenium

I want to retrieve from a website every div class='abcd' element using Selenium together with 'waiter' and 'XPATH' classes from Explicit.
The source code is something like this:
<div class='abcd'>
<a> Something </a>
</div>
<div class='abcd'>
<a> Something else </a>
...
When I run the following code (Python) I get only 'Something' as a result. I'd like to iterate over every instance of the div class='abcd' appearing in the source code of the website.
from explicit import waiter, XPATH
from selenium import webdriver
driver = webdriver.Chrome(PATH)
result = waiter.find_element(driver, "//div[#class='abcd']/a", by=XPATH).text
Sorry if the explanation isn't too technical, I'm only starting with webscraping. Thanks
I've used like this. You can also use if you like this procedure.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(PATH)
css_selector = "div.abcd"
results = WebDriverWait(driver, 10).until((expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, css_selector))))
for result in results:
print(result.text)

Categories

Resources