I'm trying to parse a html file. There are many nested divs in this html. I want to get all child divs, but not grandchildren etc.
Here is a pattern:
<div class='main_div'>
<div class='child_1'>
<div class='grandchild_1'></div>
</div>
<div class='child_2'>
...
...
</div>
So the command I'm looking for would return 2 elements - divs which classes are 'child_1' and 'child_2'.
Is it possible?
I've tried to use main_div.find_elements_by_tag_name('div') but it returned all nested divs in the div.
Here is a way to find the direct div children of the div with class name "main_div":
driver.find_elements_by_xpath('//div[#class="main_div"]/div')
The key here is the use of a single slash which would make the search inside the "main_div" non-recursive finding only direct div children.
Or, with a CSS selector:
driver.find_elements_by_css_selector("div.main_div > div")
Related
The website I am trying to scrape has all of its content laid out under the same div class type: mock-div. Upon inspecting its HTML, the relevant content is only present under those div tags which also contain the figure tag. What should be the correct XPath?
I tried to see if the following would work
response.xpath("//figure~//").getall()
but it returns ValueError: XPath error: Invalid expression in //figure~// and rightly so.
<div class="mock-div">
<h2 class="mock-h2" id="id1"> hello world </h2>
<figure class="mock-fig"><img src="file.jpg" alt="filename">
<figcaption>file caption</figcaption> </figure>
<p>text1</p>
<p>text2</p>
</div>
...
<div class="mock-div">
<h2 class="mock-h2" id="id2"> footer </h2>
<p> end of the webpage </p>
From the HTML above, we want to extract from all the matching div tag the following information:
<h2> tag: hello world
<p> tag: text1, text2
src value from img tag: file.jpg
alt value from img tag: filename
figcaption tag: file caption
Use the class as the xpath identifier.
for section in response.xpath('//div[#class="mock-div"]'):
h2 = section.xpath('./h2/text()').get()
p_s = section.xpath('./p/text()').getall()
src = section.xpath('.//img/#src').get()
alt = section.xpath('.//img/#alt').get()
fig_caption = section.xpath('.//figcaption/text()').get()
Here is an example that uses the method you described, by grabbing the parent div of the figure elements. You would simply use the .. xpath selector to grab the parent of the figure element.
For example:
import scrapy
html = """
<div class="mock-div">
<h2 class="mock-h2" id="id1"> hello world </h2>
<figure class="mock-fig"><img src="file.jpg" alt="filename">
<figcaption>file caption</figcaption> </figure>
<p>text1</p>
<p>text2</p>
</div>
"""
html = scrapy.Selector(text=html)
for elem in html.xpath("//figure"):
section = elem.xpath('./..')
print({
'h2': section.xpath('./h2/text()').get(),
'p_s': section.xpath('./p/text()').getall(),
'src': section.xpath('.//img/#src').get(),
'alt': section.xpath('.//img/#alt').get(),
'fig_caption': section.xpath('.//figcaption/text()').get() })
OUTPUT:
{'h2': ' hello world ', 'p_s': ['text1', 'text2'], 'src': 'file.jpg', 'alt': 'filename', 'fig_caption': 'file caption'}
This isn't a strategy that I would usually recommend though.
You could use XPath Axes for this. For example, this XPath will get all div tags that contain a figure tag:
response.xpath("//div[descendant::figure]").getall()
XPath Axes
An axis represents a relationship to the context (current) node, and
is used to locate nodes relative to that node on the tree.
descendant
Selects all descendants (children, grandchildren, etc.) of the current
node
See also:
XPath Axes
XPath Axes and their Shortcuts
I have a web page something like:
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link1'/>
<div> SKU#1</div>
</div>
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link2'/>
<div> SKU#2</div>
</div>
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link3'/>
<div> SKU#3</div>
</div>
When I tried to loop through the products with the following code, it will give us expected outcome:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(product.text)
SKU#1
SKU#2
SKU#3
However, if I try to locate the href of each product, it will only return the first item's link:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
print(product.text)
url=product.find_element_by_xpath("//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)
SKU#1
link1
SKU#2
link1
SKU#3
link1
What should I do to get the corrected links?
This should make your code functional:
[...]
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
print(product.text)
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)
[..]
The crux here is the dot in front of xpath, which means searching within the element only.
You need to use a relative XPath in order to locate a node inside another node.
//a[#class='header product-pod--ie-fix'] will always return a first match from the beginning of the DOM.
You need to put a dot . on the front of the XPath locator
".//a[#class='header product-pod--ie-fix']"
This will retrieve you a desired element inside a parent element:
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
So, your entire code could be as following:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)
My html code looks like:
<li>
<div class="level1">
<div id="li_hw2" class="toggle open" </div>
<ul style="" mw="220">
<li>
<div class ="level2">
...
</li>
</ul>
I am currently on the element with the id = "li_hw2", which was found by
level_1_elem = self.driver.find_element(By.ID, "li_hw2")
Now i want to go from level_1_elem to class = "level2". Is it possible to go to the parent li and than to level2? Maybe with xpath?
Hint: It is neccassary to go via the parent li and not directly to the element level2 with
self.driver.find_element(By.Class_Name, "level2")
The best-suited locator for your usecase is xpath, since you want to traverse upward as well as downwards in the HTMLDOM.
level_1_elem = self.driver.find_element(By.XPATH, "//div[#class='li_hw2']")
and then using level_1_elem web element, You can do the following :
to directly go to following-sibling
level_1_elem.find_element(By.XPATH, ".//following-sibling::ul/descendant::div[#class='level2']")
Are you sure about the html i think the ul should group all the li if it s the case then it s easy if not i realy dont get that html.
//div[#class="level1"]/parent::li/parent::ul/li/div[#class="level2"]
I have the following HTML page. I want to get all the links inside a specific div. Here is my HTML code:
<div class="rec_view">
<a href='www.xyz.com/firstlink.html'>
<img src='imga.png'>
</a>
<a href='www.xyz.com/seclink.html'>
<img src='imgb.png'>
</a>
<a href='www.xyz.com/thrdlink.html'>
<img src='imgc.png'>
</a>
</div>
I want to get all the links that are present on the rec_view div. So those links that I want are,
www.xyz.com/firstlink.html
www.xyz.com/seclink.html
www.xyz.com/thrdlink.html
Here is the Python code which I tried with
from selenium import webdriver;
webpage = r"https://www.testurl.com/page/123/"
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get(webpage)
element = driver.find_element_by_css_selector("div[class='rec_view']>a")
link = element.get_attribute("href")
print(link)
How can I get those links using selenium on Python?
As per the HTML you have shared to get the list of all the links that are present on the rec_view div you can use the following code block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.testurl.com/page/123/')
elements = driver.find_elements_by_css_selector("div.rec_view a")
for element in elements:
print(element.get_attribute("href"))
Note : As you need to collect all the href attributes from the div tag so instead of find_element_* you need to use find_elements_*. Additionally, > refers to immediate <a> child node where as you need to traverse all the <a> child nodes so the desired css_selector will be div.rec_view a
I have to crawl data with Scrapy like this:
<div class="data"
data-name="{"id":"566565", "name":"data1"}"
data-property="{"length":"444", "height":"678"}"
>
data1
</div>
<div class="data"
data-name="{"id":"566566", "name":"data2"}"
data-property="{"length":"555", "height":"777"}"
>
data2
</div>
I need data-name and data-property attributes. My selector is:
selections = Selector(response).xpath('//div[#class="data"]/attribute::data-property').extract()
How can I include data-name attribute in selections?
The following XPath should return data-property and data-name attributes :
//div[#class='data']/attribute::*[name()='data-property' or name()='data-name']
XPath Demo : http://www.xpathtester.com/xpath/e720602b62461f3600989be73eb15aec
If you need to return the two attributes as a pair in a certain format for each parent div, then this can't be done using pure XPath 1.0. Some python would be required, maybe using list comprehension (not tested) :
selections = [div.xpath('concat(#data-property, " ", #data-name)').extract() \
for div in Selector(response).xpath('//div[#class="data"]')]