I'm somewhat inexperienced in scraping websites with lots of sub elements and am trying to understand the best way to loop through elements that have data you want buried in further levels of sub elements.
Here is an example HTML
<div class="s-item__info clearfix">
<h3 class="s-item__title">The Music Tree Activities Book: Part 1 (Music Tree (Summy)) by Clark, Frances, </h3>
</a>
<div class="s-item__subtitle"><span class="SECONDARY_INFO">Pre-Owned</span></div>
<div class="s-item__reviews">
</div>
<div class="s-item__details clearfix">
<div class="s-item__detail s-item__detail--primary"><span class="s-item__price">$3.99</span></div>
<span class="s-item__detail s-item__detail--secondary">
</span>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__purchase-options-with-icon" aria-label="">Buy It Now</span></div>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__shipping s-item__logisticsCost">Free shipping</span></div>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__free-returns s-item__freeReturnsNoFee">Free returns</span></div>
<div class="s-item__detail s-item__detail--primary"></div>
</div>
</div>
There are multiple items so I started by getting all of them in a list and I can find each title by iterating through but am having an issue getting price. Example code
for item in driver.find_elements_by_class_name("s-item__info"):
title = item.find_element_by_xpath('.//h3')
print(title.text)
details = item.find_element_by_xpath('.//span[#class="s-item__price"]')
print(details.text)
This gets the Title of the item, but can't find the price. If I look outside of "s-item_info" element and just use the driver I can get all the prices with the code below, but wondering why it cant find it in the info element, I would think the details would be a subelement and .// would look through those.
driver.find_elements_by_class_name("s-item__price")
Have also tried
find_element_by_xpath('.//div[#class="s-item__detail"]//span[#class="s-item__price"]')
I can grab the data I need but want to understand why I can't get the price when I try to iterate through each item. Thanks
See if this works
for item in driver.find_elements_by_class_name("s-item__info"):
title = item.find_element_by_xpath('.//h3')
print(title.text)
details = item.find_element_by_xpath('.//following::div[contains(#class,'s-item__details')]//span[#class='s-item__price']')
print(details.text)
OK, there are several problems here:
s-item__info is not the only class name on that element, you should use
//div[contains(#class,'s-item__info')] instead
The first element matching this class name is not a valid search result.
The simples approach to make your code work can be:
for item in driver.find_elements_by_xpath("//div[contains(#class,'s-item__info')]"):
title = item.find_elements_by_xpath('.//h3')
if(title):
print(title[0].text)
details = item.find_elements_by_xpath('.//span[#class="s-item__price"]')
if(details):
print(details[0].text)
This will print data if existing, otherwise just print empty strings
Related
Given this html sample :
<div class="measure-tab"> --- i want to select this one
<span class="color_title">someText</span>
</div>
<div class="measure-tab"> --- i dont want to select this one
<span class="color_title">someText</span>
<div>
<span class="qwery">jokerText</span>
</div>
</div>
<div class="measure-tab"> --- i want to select this one
<span class="color_title">someText</span>
</div>
I want to select the div that has #class='measure-tab' which has under it a span that as a specific class and text = someText and a nested span that has a specific class and does not contain text = 'jokerText', all this in an XPATH
What i've tried is :
//div[contains(#class, 'measure-tab') and //span[#class="color_title" and (contains(text(),'someText')) and //span[#class="color_title" and not(contains(text(),'jokerText'))]]
But this dosen't seem to work.
I also used This post as inspiration.
EDIT : Corrected bad description of what is the goal for this question
EDIT, made a new solution :
//div[contains(#class, 'measure-tab') and //span[contains(#class, 'color_title') and //span[not(contains(#class, 'qwery'))]]]
But this returns all the divs, instead of not matching it with --- i dont want to select this one
<span class="color_title">someText</span>
<div>
<span class="qwery">jokerText</span>
</div>
I feel so close but yet so far, haha, it dosen't make sense for me why is it matching it with <span class="qwery">jokerText</span> when i wrote not contains there
I believe this is what you are looking for-
MyDivs = driver.find_elements_by_xpath("//div[#class='measure-tab' and not(descendant::*[text() = 'jokerText' and #class = 'qwery'])]")
This will select all the div tag which does not have jokerText in it.
You can query with not(following-sibling::div/span.....)
Try with following xpath:
//span[#class='color_title' and not(following-sibling::div/span[#class='qwery' and text()='jokerText'])]/parent::div[#class='measure-tab']
I have a list of web elements that I defined as follows:
sellersList = browser.find_elements_by_class_name('gig-card-layout')
Each web element looks like this:
<div class="gig-card-layout">
<div>
<div class="gig-wrapper card" data-gig-id="gig_id" data-impression-collected="true">
...
<div class="seller-info text-body-2">...</div>
<h3 class="text-display-7>...</h3>
<footer>
...
</footer>
</div>
</div>
</div>
I would like to access the price text located in the footer of each web element using a for loop.
How could I do that?
You can get the price elements directly using the below snippet.
# get all price elements
priceElems = driver.find_elements_by_css_selector(".gig-card-layout footer a")
# iterate through all price elements and print the price
for priceElem in priceElems:
print(priceElem.get_attribute('title'))
If you want to use the sellersList and iterate through the list then you can do the below
for seller in sellersList:
print(seller.find_element_by_xpath(".//footer/a").get_attribute('title'))
I'm trying to extract all of the "jobs" someone has on a LinkedIn profile (for educational purposes only), but I cannot find the right approach using BeautifulSoup.
I notice that the header for Jobs are nested in the following way:
<div class="pv-entity__company-details">
<div class="pv-entity__company-summary-info">
<h3 class="t-16 t-black t-bold">
<span class="visually-hidden">Company Name</span>
<span>Morgan Stanley</span>
</h3>
<h4 class="t-14 t-black t-normal">
<span class="visually-hidden">Total Duration</span>
<span>2 yrs 7 mos</span>
</h4>
</div>
</div>
I'm trying to extract the text "Morgan Stanley" from every t-16 t-black t-bold BUT ONLY if it is under the pv-entity__company-summary-info div.
Trying something like this:
all_jobs = ', '.join(sel.xpath('//*[contains(#class, "t-16 t-black t-bold")]/text()').extract())
Gives too much spurious text because there are lots of objects that contain the t-16 t-black t-bold class that are not under the pv-entity__company-summary-info class.
Any thoughts?
You didn't paste too much html, but does this selector work:
all_jobs = ', '.join(sel.xpath('//*div[class="pv-entity__company-summary-info"]h3/span[text()]').extract())
# if you want just second span in the html above
all_jobs = ', '.join(sel.xpath('//*div[class="pv-entity__company-summary-info"]h3/span[2][text()]').extract())
xpath cheatsheet
I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'
<div>
<div id="ide_1"> </div>
<div id="ide_3"> </div>
<div id="ide_5"> </div>
<div id="ide_7"> </div>
</div>
I want to select all ids of the child div and insert them in a list but i didn't find any solution to get into the parent div. I am trying to find all id that's similar to ide_ because that's the fix part which wouldn't change.
You can use css_selector search for all ids that contains ide_
find_elements_by_css_selector('[id*="ide_"]')
You can use the find_elements_by_xpath() , this will return a list of elements with specified path.
Lets say your div is located as
<html>
<body>
<form>
<table>
<div>
Then you have to specify as
driver.find_elements _by_xpath(r'html/body/form/table/div')
In case if you have any classname or any text or anything in the main div element you can Use any of the find_elements method . for further reading Locating Elements
Hope it helps. Happy Coding :)