BeautifulSoup and find - python

I have a html code:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
How do I get all that in a div with an id div1?
soup.find('div',{'id':"div1"}) returns:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
I need to get only:
<div id='d'> </div>
<p></p>

See the documentation, specifically .find() and .contents.

You want the content between the start and end of the tag including all child tags.
soup.find('div', id="div1").contents

Related

selenium scrape multiple attributes within a block at the same time

I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.
You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]

xpath to match the specfic element based on inner html child tag text

I have an html as shown below
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Setting</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Home</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >products</span>
</span>
</div>
I want to click the img icon based on the text in the last span tag.
for example , I want to select the first img tag , if the last span contains "Setting" . Can you please help me in writing xpath for this UI element to use in selenium webdriver python
I think this XPath will help you.Here i find the img class then match the text contains
//*[#class="dojoimg"]//span[contains(text(), "Setting")]
Hope this concept will help you.
Here is my solution :
Using find_element_by_link_text
driver.find_element_by_link_text("Reveal").click()

Extract text with a Python XPath expression

I want to display http:///gb/groceries/easter-essentials--%28approx-205kg%29.
In scrapy I used this XPath expression:
response.xpath('//div[#class="productNameAndPromotions"]/h3/a/href').extract()
but it didn't work!
<div class="product ">
<div class="productInfo">
<div class="productNameAndPromotions">
<h3>
<a href="http:///gb/groceries/easter-essentials--%28approx-205kg%29">
<img src="http:co.uk/wcsstore7.20.1.145/ExtendedSitesCatalogAssetStore/image/catalog/productImages/08/020000008_L.jpeg" alt="" />
</a>
</h3>
</div>
</div>
</div>
This //div[#class="productNameAndPromotions"]/h3/a/href means you want to get element href which is child of a.
If you want to extract nodes' attribute, e.g. href, you need to use #attribute syntax. Try below:
//div[#class="productNameAndPromotions"]/h3/a/#href

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

Find sibling node by xpath with Python selenium

Here is a fragment of xml. I need to use selenium to find the quote id value 1616968600, but I'm new to xpath and I could use some help.
<div class="row">
.....
</div>
<div class="row">
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Business_Partner_Id">Business partner name: </div>
<div class="col-md-2 ng-binding">Avnet Hall-Mark</div>
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Quote_Id">Quote ID: </div>
<div class="col-md-3 ng-binding">1616968600</div>
</div>
Locate the div having Quote ID text and get the next sibling:
//div[contains(., "Quote ID")]/following-sibling::div
Usage:
quote_id_elm = driver.find_element_by_xpath('//div[contains(., "Quote ID")]/following-sibling::div')
print(quote_id_elm.text)

Categories

Resources