Extract links from an html page using xpath - python

I'm trying to extract the links on this html page:
<div class="listbox">
<div class="mainbox" onclick="www.abc.com">
I've tried using:
//div[#class="listbox"]/a/text()
//div/onclick/text()
but they return an empty list.

Such XPath must work for you.
/div/div/#onclick
or more precise
/div[#class="listbox"]/div[#class="mainbox"]/#onclick

In your case you can obtain the link via using Selenium and the getAttribute method.
First find the element (or elements and then loop) that have the links inside their onclick attributes, then just get them via getAttribute:
Selenium + Java:
String link = driver.findElement(By.className("mainbox")).getAttribute("onclick");
Selenium + Python:
I'm no python guy, but it should work like this:
link = driver.find_element_by_class_name("mainbox")).get_attribute("onclick");

Related

Find all elements with href tag containing certain text with Selenium and Python

Lets say I have html code with three links
Whatever
Whatever
Whatever
and I want to use selenium to find all elements which have a href tag which includes the string "hey" (in this case the first two links). How would I write python Selenium code which accomplishes this?
This works:
all_href = driver.find_elements(By.XPATH, "//*[contains(#href, 'hey')]")
print(len(all_href)
This XPath will do this work:
"//a[contains(#href,'hey')]"
To use that with Selenium you will need a find_elements or findElements method, depends on the language binding you use with Selenium.
For Selenium in Python this will give you the list of all such elements:
all_hey_elements = driver.find_elements(By.XPATH, "//a[contains(#href, 'hey')]")

How to extract the href attribute of an element using Selenium and Python

I want to scrape the URLs within the HTML of the 'Racing-Next to Go' section of www.tab.com.au.
Here is an excerpt of the HTML:
<a ng-href="/racing/2020-07-31/MACKAY/MAC/R/8" href="/racing/2020-07-31/MACKAY/MAC/R/8"><i ng-
All I want to scrape is the last bit of that HTML which is a link, so:
/racing/2020-07-31/MACKAY/MAC/R/8
I have tried to find the element by using xpath, but I can't get the URL I need.
My code:
driver = webdriver.Firefox(executable_path=r"C:\Users\Harrison Pollock\Downloads\Python\geckodriver-v0.27.0-win64\geckodriver.exe")
driver.get('https://www.tab.com.au/')
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.text)
Probaly you want to use get_attribute insted of .text. Documentation here.
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))
Yes, you can use getAttribute(attributeLocator) function for your requirement.
selenium.getAttribute(//xpath#href);
Specify the Xpath of the element for which you require to know the class of.
The value /racing/2020-07-31/MACKAY/MAC/R/8 within the HTML is the value of href attribute but not the innerText.
Solution
Instead of using the text attribute you need to use get_attribute("href") and the effective lines of code will be:
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))

Having trouble finding certain <div> tags using CSS Selectors

I am trying to scrape information from a website using a CSS Selector in order to get a specific text element but have come across a problem. I try to search for my desired portion of the website but my program is telling me that it does not exist. My program returns an empty list.
I am using the requests and lxml libraries and am using CSS Selectors to do my HTML Scraping. I have Python 3.7. I try searching for the part of the website that I need with a selector and it is not appearing. I have also tried using XPath but that has failed as well. I have tried using the following selector:
div#showtimes
When I use this selector, I get the following result:
[<Element div at 0x3bf6f60>]
I get the expected result, which is the desired element. When I try to go one step further and access the element nested inside of the div#showtimes element (see below), I get an empty list.
div#showtimes div
I get the following result:
[]
Through inspection of the website's HTML, I know that there is a nested element within the div#showtimes element. This problem has occurred on other web pages as well. I am using the code below.
import requests
from lxml import html
from lxml.cssselect import CSSSelector
# Set URL
url = "http://www.fridleytheatres.com/location/7425/Paramount-7-Theatres-
Showtimes"
# Get HTML from page
page = requests.get(url)
data = html.fromstring(page.text)
# Set up CSSSelector
sel = CSSSelector('div#showtimes div')
# Apply Selector
results = sel(data)
print(results)
I expect the output to be a list containing a element, but it is returning an empty list [].
If I understand the problem correctly, you're attempting to get a div element which is a child of div#showtimes. Try using div#showtimes > div.

Selecting inner text when using find element by css selector python, selenium and create a loop

im trying to get all link text from a tag p and with a specific class. Then create a loop to find all other similar elements.
how it looks
so far i am using this :
the value i want is in
<div class='some other class'>
<p class='machine-name install-info-entry x-hidden-focus'> text i want
</p> ==$0
installations = browser.find_elements_by_css_selector('p.machine-name.install-info-entry.x-hidden-focus')
any help is appreciated. thanks.
You can just use .text
installations = browser.find_elements_by_css_selector('p.machine-name.install-info-entry.x-hidden-focus')
for installation in installations:
print(installation.text)
Note that installations is a list of web elements, whereas installation is just a web element from the list.
UPDATE1:
If you want to extract the attribute from a web element, then you can follow this code:
print(installation.get_attribute("attribute name"))
You should pass your desired attribute name in get_attribute method.
You can read innerHTML attribute to get source of the content of the element or outerHTML for source with the current element.
installation.get_attribute('innerHTML')
Hope this will be helpful.

Finding a element with a generated Id in selenium?

HTML:
<g id="OpenLayers.Layer.Vector_101_vroot">
<image id="OpenLayers.Geometry.Point_259_status"..></image>
So the page generates the above and the number section of the Id is different on each load.
How do I located them, or even a group of them that match the pattern using selenium and python?
Use Xpaths like below:
//g[contains(#id, 'OpenLayers.Layer.Vector')]
//image[contains(#id, 'OpenLayers.Geometry.Point')]
Hope if helps!
according to this answer, you can use css3 substring matching attribute selector.
the following code clicks an element which contains OpenLayers.Layer.Vector in id attribute.
Python
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://localhost:1111/')
browser.find_element_by_css_selector('[id*="OpenLayers.Layer.Vector"]').click()
HTML (which is displayed in http://localhost:1111/)
<button id="OpenLayers.Layer.Vector_123" onclick="alert(1);return false">xxx</button>
no need for and pattern matching, you can use this module here called "beautiful soup" with some easy documentation here.
for example to get all tags with id="OpenLayers.Layer.Vector_101_vroot" use:
soup = BeautifulSoup(<your_html_as_a_string>)
soup.find_all(id="OpenLayers.Layer.Vector_101_vroot")

Categories

Resources