Extract text with a Python XPath expression

Extract text with a Python XPath expression - python

I want to display http:///gb/groceries/easter-essentials--%28approx-205kg%29.
In scrapy I used this XPath expression:
response.xpath('//div[#class="productNameAndPromotions"]/h3/a/href').extract()
but it didn't work!
<div class="product ">
<div class="productInfo">
<div class="productNameAndPromotions">
<h3>
<a href="http:///gb/groceries/easter-essentials--%28approx-205kg%29">
<img src="http:co.uk/wcsstore7.20.1.145/ExtendedSitesCatalogAssetStore/image/catalog/productImages/08/020000008_L.jpeg" alt="" />
</a>
</h3>
</div>
</div>
</div>

This //div[#class="productNameAndPromotions"]/h3/a/href means you want to get element href which is child of a.
If you want to extract nodes' attribute, e.g. href, you need to use #attribute syntax. Try below:
//div[#class="productNameAndPromotions"]/h3/a/#href

Related

How can I extract only the link out of an <a href, which includes li elements, using beautifulsoup? [duplicate]

This question already has answers here:
BeautifulSoup - extracting attribute values
(2 answers)
Closed 4 years ago.
I am all new to python and beautifulsoup. I want to get the link form the href. Unfortunately, the anchor also includes other and irrelevant data.
Help is much apreciated
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>

Attribute value can be extracted using [] brackets.
For instance, if to extract alt value an img tag use:
image_example = soup.find('img') and then print(image_example['alt'])
Updated code:
from bs4 import BeautifulSoup
data = '''
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a> <a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>
'''
soup = BeautifulSoup(data, 'html.parser')
url_address = soup.find('a')['href']
print (url_address) # Output: /link-i-want/to-get.html
The format is as follows.
soup.find('<tag>')['<attribute-name>'] .
We can use .get(attr) like mentioned. soup.find('<tag>').get('<attr>')
Reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

xpath to match the specfic element based on inner html child tag text

I have an html as shown below
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Setting</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Home</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >products</span>
</span>
</div>
I want to click the img icon based on the text in the last span tag.
for example , I want to select the first img tag , if the last span contains "Setting" . Can you please help me in writing xpath for this UI element to use in selenium webdriver python

I think this XPath will help you.Here i find the img class then match the text contains
//*[#class="dojoimg"]//span[contains(text(), "Setting")]
Hope this concept will help you.

Here is my solution :
Using find_element_by_link_text
driver.find_element_by_link_text("Reveal").click()

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

XPath not working in Scrapy

I have the following XPath that I am trying to extract data from:
/html/body/div[2]/div[2]/div/div/div[4]/ul[2]/li/div
I am trying to simply test this through Scrapy Shell, so I do the following:
scrapy shell "https://www.rentler.com/listing/520583"
and then:
hxs.select('/html/body/div[2]/div[2]/div/div/div[4]/ul[2]/li/div').extract()
But this returns [].
Any ideas?
Edit
The whole reason that I want to do this is because I need to breakup these 5 items into individual variables, not one array (which I currently have working):
<ul class="basic-stats">
<li>
<div class="count">4</div>
<div class="label">Bed</div>
</li>
<li>
<div class="count">2</div>
<div class="label">Bath</div>
</li>
<li>
<div class="count">1977</div>
<div class="label">Year</div>
</li>
<li>
<div class="count">1960</div>
<div class="label">SqFt</div>
</li>
<li>
<div class="count">0</div>
<div class="label">Acres</div>
</li>

I solved this. To access the individual items above, you simply add li[1],li[2], etc.

BeautifulSoup and find

I have a html code:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
How do I get all that in a div with an id div1?
soup.find('div',{'id':"div1"}) returns:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
I need to get only:
<div id='d'> </div>
<p></p>

See the documentation, specifically .find() and .contents.

You want the content between the start and end of the tag including all child tags.
soup.find('div', id="div1").contents

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text with a Python XPath expression - python

This //div[#class="productNameAndPromotions"]/h3/a/href means you want to get element href which is child of a. If you want to extract nodes' attribute, e.g. href, you need to use #attribute syntax. Try below: //div[#class="productNameAndPromotions"]/h3/a/#href

Related

How can I extract only the link out of an <a href, which includes li elements, using beautifulsoup? [duplicate]

xpath to match the specfic element based on inner html child tag text

beautifulsoup: finding specific class name in nested div

XPath not working in Scrapy

BeautifulSoup and find

Categories

Resources