Python Selenium - Extract text within <br> - python

I am currently looping through all labels and extracting data from each page, however I cannot extract the text highlighted below each category (i.e. Founded, Location etc..). The text appears to be within " " and above the br tags, can anyone advise how to extract please?
Website -
https://labelsbase.net/knee-deep-in-sound
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Founded</span>
</div>
</div>
2003
<br>
<div class="line-title-block">
<div class="line-title-wrap">
<span class="line-title-text">Location</span>
</div>
</div>
United Kingdom
<br>
I have tried using driver.find_elements_by_xpath & driver.execute_script but cannot find a solution.
Error Message -
Message: invalid selector: The result of the xpath expression "/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]" is: [object Text]. It should be an element.
Screenshot
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
import string
PATH = '/Applications/chromedriver'
driver = webdriver.Chrome(PATH)
wait = WebDriverWait(driver, 10)
links = []
url = 'https://labelsbase.net/knee-deep-in-sound'
driver.get(url)
time.sleep(5)
# -- Title
title = driver.find_element_by_class_name('label-name').text
print(title,'\n')
# -- Image
image = driver.find_element_by_tag_name('img')
src = image.get_attribute('src')
print(src,'\n')
# -- Founded
founded = driver.find_element_by_xpath("/html/body/div[3]/div/div[1]/div[2]/div/div[1]/text()[2]").text
print(founded,'\n')
driver.quit()

Can you check with this
founded = driver.find_element_by_xpath("//*[#*='block-content']").get_attribute("innerText")
You can take the XPath of the class="block-content"
O/P

Related

Get href link of a card which does not have anchor tag for it and link is generated on click

I have this website
https://kw.com/agent/search/TX/Texas%20City/
for which there is no href/link present for the card but on click of the card the page containing details for the card is opened.
How can I get link for those cards to the page to which I am getting redirected?
For eg for the first card
I expected I would get the link
https://kw.com/agent/UPA-6863980660422574080-3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
url = 'https://kw.com/agent/search/TX/Texas%20City/'
driver.get(url)
Card html
<div class="AgentCard">
<div class="AgentCard__main">
<div class="AgentCard__avatar Avatar" data-testid="avatar">
<div class="Avatar__container Avatar__container--highlight Avatar__container--null">
<div class="AvatarImage__bg" style="background-image: url("https://cflare.smarteragent.com/rest/Resizer?url=https%3A%2F%2Fstorage.googleapis.com%2Fattachment-prod-e2ad%2F808407%2Fc6604l887phbane09mh0.png&quality=0.8&webp=true&sig_id=69");"></div>
</div>
<div class="AgentCard__avatarOverlay"></div>
</div>
<div class="AgentCard__content">
<div class="AgentCard__name AgentCard__row">Jessica Agrella</div>
<div class="AgentCard__languages AgentCard__row">English</div>
<div class="AgentCard__location AgentCard__row">Katy, TX</div>
<div>
<div class="AgentCard__text AgentCard__row">License #694496</div>
<div class="AgentCard__text AgentCard__row">Keller Williams Signature</div>
</div>
</div>
</div>
</div>
The url for each card is being generated on the fly, by javascript (or at least I couldn't find it in DOM). There is a way to deobfuscate each and every script running on page and look for the logic of generating said url... but I refuse to do that, on principle. Instead I will use selenium only. What follows is a hack (in the good sense of the word):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://kw.com/agent/search/TX/Texas%20City'
counter = 0
browser.get(url)
t.sleep(3)
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'AgentCard__name.AgentCard__row')))
print(browser.current_url)
agent_cards = browser.find_elements(By.CLASS_NAME, 'AgentCard__name.AgentCard__row')
print(len(agent_cards))
for x in range(len(agent_cards)):
current_card = agent_cards[counter]
print(current_card.text)
browser.execute_script('arguments[0].scrollIntoView();', current_card)
# print('scrolled elem into view')
t.sleep(3)
browser.execute_script('window.scrollBy(0, -100);')
# print('scrolled page by 100px')
# t.sleep(5)
# t.sleep(5)
current_card.click()
print(browser.current_url)
browser.back()
counter = counter + 1
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'AgentCard__name.AgentCard__row')))
agent_cards = browser.find_elements(By.CLASS_NAME, 'AgentCard__name.AgentCard__row')
This returns:
https://kw.com/agent/search/TX/Texas%20City
48
Jessica Agrella
https://kw.com/agent/UPA-6863980660422574080-3
Rick Aguilar
https://kw.com/agent/UPA-6587385217807011841-2
[....]

Use Python to Scrape for Data in Family Search Records

I am trying to scrape the following record table in familysearch.org. I am using the Chrome webdriver with Python, using BeautifulSoup and Selenium.
Upon inspecting the page I am interested in, I wanted to scrape from the following bit in HTML. Note this is only one element part of a familysearch.org table that has 100 names.
<span role="cell" class="td " name="name" aria-label="Name"> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <span><sr-cell-name name="Jame Junior " url="ZS" relationship="Principal" collection-name="Index"></sr-cell-name></span> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> </span>
Alternatively, the name also shows in this bit of HTML
<a class="name" href="/ark:ZS">Jame Junior </a>
From all of this, I only want to get the name "Jame Junior", I have tried using driver.find.elements_by_class_name("name"), but it prints nothing.
This is the code I used
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from getpass import getpass
username = input("Enter Username: " )
password = input("Enter Password: ")
chrome_path= r"C:\Users...chromedriver_win32\chromedriver.exe"
driver= webdriver.Chrome(chrome_path)
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=...")
usernamet = driver.find_element_by_id("userName")
usernamet.send_keys(username)
passwordt = driver.find_element_by_id("password")
passwordt.send_keys(password)
login = driver.find_element_by_id("login")
login.submit()
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=.....")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
#for tag in driver.find_elements_by_class_name("name"):
# print(tag.get_attribute('innerHTML'))
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
Try to access the sr-cell-name tag.
Selenium:
for tag in driver.find_elements_by_tag_name("sr-cell-name"):
print(tag.get_attribute("name"))
BeautifulSoup:
for tag in soup.find_all("sr-cell-name"):
print(tag["name"])
EDIT: You might need to wait for the element to fully appear on the page before parsing it. You can do this using the presence_of_element_located method:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
for tag in driver.find_elements_by_class_name("name"):
print(tag.get_attribute('innerHTML'))
I was looking to do something very similar and have semi-decent python/selenium scraping experience. Long story short, FamilySearch (and many other sites, I'm sure) use some kind of technology (I'm not a JS or web guy) that involves shadow host. The tags are essentially invisible to BS or Selenium.
Solution: pyshadow
https://github.com/sukgu/pyshadow
You may also find this link helpful:
How to handle elements inside Shadow DOM from Selenium
I have now been able to successfully find elements I couldn't before, but am still not all the way where I'm trying to get. Good luck!

How to get imdb episode ids using python web crawler with selenium in pycharm

Hi please am using python selenium in pycharm to scrapy imdb data for episodes id am able to retrieve, some data using the below code to print "main.text" data such as episode name , description, rating for the complete season but when i run it program with print(newData.text) i am getting error AttributeError: 'list' object has no attribute 'text' and when i run it without text that is print(newData) i get result as [ ]
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1")
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "episodes_content"))
)
links = main.find_elements_by_tag_name("image")
for image in links:
newData = image.find_elements_by_tag_name("data_const")
print(main.text)
finally:
driver.quit()
Here is the inspect element i am using to check for the tag that has the link
<div class="image">
<a href="/title/tt1478775/?ref_=ttep_ep3" title="Friday Night Bites" itemprop="url"> <div data-const="tt1478775" class="hover-over-image zero-z-index ">
<img class="zero-z-index" alt="Friday Night Bites" src="https://m.media-amazon.com/images/M/MV5BMTg1OTc3ODEyOV5BMl5BanBnXkFtZTcwMDE2MDY4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224" height="126">
<div>S1, Ep3</div>
</div>
i have tried using all available data type to get the value of the data_const to print but none are working
here is a list of tags i have tried
--image.find_elements_by_tag_name("data_const"), --image.find_elements_by_class_name("data_const"), --image.find_elements_by_name("data_const"), --image.find_elements_by_id("data_const"), --image.find_elements_by_link_text("data_const"),
--image.find_elements_by_css_selector("data_const"),

selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with id 'search-facet-city'"

I am trying to scrape the following website using Python 3, Selenium, and PhantomJS:
https://health.usnews.com/best-hospitals/search
I need to locate a search field and enter text into it, and then press enter to generate the search results. Below is the HTML that corresponds to the search field I am trying to locate:
<div class="search-field-view">
<div class="block-tight">
<label class="" for="search-facet-city">
<input id="search-facet-city" autocomplete="off" name="city"
type="text" data-field-type="text" placeholder="City, State or ZIP"
value="" />
</label>
</div>
</div>
Below is my Python 3 code that attempts to locate this search field using the id "search-facet-city."
def scrape(self):
url = 'https://health.usnews.com/best-hospitals/search'
location = 'Massachusetts'
# Instantiate the driver
driver = webdriver.PhantomJS()
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(10)
elem = driver.find_element_by_id("search-facet-city")
elem.send_keys(self.location)
driver.close()
I need to scrape some results from the page once the text is entered into the search field. However, I keep getting a NoSuchElementException error; it is not able to locate the search box element despite the fact that it exists. How can I fix this?
I tried this with Chrome:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
url = 'https://health.usnews.com/best-hospitals/search'
location = 'Massachusetts'
# Instantiate the driver
driver = webdriver.Chrome(executable_path=r'/pathTo/chromedriver')
#driver = webdriver.PhantomJS(executable_path=r'/pathTo/phantomjs')
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.save_screenshot('out.png');
elem=wait.until(EC.element_to_be_clickable((By.XPATH,"//div[#class='search-field-view']")))
span = elem.find_element_by_xpath("//span[#class='twitter-typeahead']")
input=span.find_element_by_xpath("//input[#class='tt-input' and #name='city']");
input.send_keys(location)
driver.save_screenshot('out2.png');
and it works.
But if I try with phantomJS, in driver.save_screenshot('out.png'); I obtain:
As #JonhGordon said in the comments, the website make some checks. If you want to use phantomJS, you could try to change the desired_capabilities or the service_args.

Selenium - How to jump from one sibling to other

I am using Selenium-Python to Scrape the content at this link.
http://targetstudy.com/school/62292/universal-academy/
HTML code is like this,
<tr>
<td>
<i class="fa fa-mobile">
::before
</i>
</td>
<td>8349992220, 8349992221</td>
</tr>
I am not sure how to get the mobile numbers in using the class="fa fa-mobile"
Could someone please help. Thanks
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
import lxml.html
from selenium.common.exceptions import NoSuchElementException
path_to_chromedriver = 'chromedriver.exe'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get('http://targetstudy.com/school/62292/universal-academy/')
stuff = browser.page_source.encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
address1 = tree.xpath('//td/i[#class="fa fa-mobile"]/parent/following-sibling/following-sibling::text()')
print address1
You don't need lxml.html for this. Selenium is very powerful in Locating Elements.
Pass //i[#class="fa fa-mobile"]/../following-sibling::td xpath expression to find_element_by_xpath():
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> browser.get('http://targetstudy.com/school/62292/universal-academy/')
>>> browser.find_element_by_xpath('//i[#class="fa fa-mobile"]/../following-sibling::td').text
u'83499*****, 83499*****'
Note, added * for not showing the real numbers here.
Here the xpath first locates the i tag with the fa fa-mobile class, then goes to the parent and gets the next following td sibling element.
Hope that helps.

Categories

Resources