Css selector of parent text - python

I want to get this figure $185,000,000. Is there any way to get text from parent tag and avoiding text from child tags
<div class="txt-block">
<h4 class="inline">Budget:</h4>
$185,000,000
<span class="attribute">(estimated)</span>
</div>

<div class="txt-block">
<h4 class="inline">Budget:</h4>
<span class="value">$185,000,000</span>
<span class="attribute">(estimated)</span>
</div>

Yes you can do this. Simply write
response.css('.txt-block::text').extract_first()
This will return only $185,000,000. If you put space between :: and .txt-block. This extract the text of children also

Related

How to select multiple children from HTML tag with Python/BeautifulSoup if exists?

I'm currently scraping elements from a webpage. Let's say i'm iterating over a HTML reponse and a part of that response looks like this:
<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>
I know I can access the first element under title within the span class like so:
row[-1].find('span')['title']
"SLT-4 2435
But I would like to select the second title under the span class (if it exists) as a string too, like so: "SLT-4 2435, SLT-6 2631"
Any ideas?
You can use the find_all() function to find all the span elements with class material-part
titles = []
for material_part in row[-1].find_all('span', class_='material-part'):
titles.append(material_part['title'])
result = ', '.join(titles)
In alternativ to find() / find_all() you could use css selectors:
soup.select('span.material-part[title]')
,iterate the ResultSet with list comprehension and join() your texts to a single string:
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Example
from bs4 import BeautifulSoup
html = '''<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>'''
soup = BeautifulSoup(html)
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Output
SLT-4 2435,SLT-6 2631

How to get all text in inside div parent with xpath

I want to get all text inside a div with xpath
Here HTML code:
<div class="JobDescriptionsc__DescriptionContainer-sc-1jylha1-2 dGyoDf">
<div class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf">
<div class="DraftEditor-root">
<div class="DraftEditor-editorContainer">
<div class="public-DraftEditor-content" contenteditable="false" spellcheck="false" style="outline:none;user-select:text;-webkit-user-select:text;white-space:pre-wrap;word-wrap:break-word">
<div data-contents="true">
#Here the all text
<div class="" data-block="true" data-editor="d54la" data-offset-key="bhkoa-0-0">
<div data-offset-key="bhkoa-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="bhkoa-0-0" style="font-weight:bold">
<span data-text="true">Job Description:</span>
</span>
</div>
</div>
<div class="" data-block="true" data-editor="d54la" data-offset-key="51e5u-0-0">
<div data-offset-key="51e5u-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="51e5u-0-0">
<span data-text="true">ยท Identify & developed application base on predefined business requirements.</span>
</span>
</div>
</div>
...
#there's more, I'm just showing you a few
</div>
</div>
</div>
</div>
</div>
</div>
This my XPath code:
dom_job.xpath('//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()')
I need the all text inside the div parent with xpath, can it?
I'm assuming the Python module which provides your XPath interpreter supports XPath version 1. Your XPath expression below returns the set of all text nodes which are descendants of the div element:
//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()
You should be able to iterate over all that collection of text nodes, and concatenate them into a single string, in Python.
But it's simpler, if you want the concatenated value of the text nodes within a particular div, to just apply the XPath string() function to the div; e.g.:
string(//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"])
See https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
Note that, in XPath 1, if you apply the string() function to a larger set of nodes (such as the set of text nodes returned by your first query), the function will return the string value of just the first node.

How can I parse html file using python and beautiful soup from html tag under html tag value?

My html file contains same tag(<span class="fna">) multiple times. If I want to differentiate this tag then i need to look previous tag. Tag() under tag(<span id="field-value-reporter">).
In beautiful soup, I can apply only on tag condition like, soup.find_all("span", {"id": "fna"}). This function extract all data for tag(<span class="fna">) but I need only which contain under tag(<span id="field-value-reporter")
Example html tags:
<div class="value">
<span id="field-value-reporter">
<div class="vcard vcard_287422" >
<a class="email " href="/user_profile?user_id=287422" >
<span class="fna">Chris Pearce (:cpearce)
</span>
</a>
</div>
</span>
</div>
<div class="value">
<span id="field-value-triage_owner">
<div class="vcard vcard_27780" >
<a class="email " href="/user_profile?user_id=27780">
<span class="fna">Justin Dolske [:Dolske]
</span>
</a>
</div>
</span>
</div>
Use soup.select:
soup.select('#field-value-reporter a > span') # select for all tags that are children of a tag whose id is field-value-reporter
>>> [<span class="fna">Chris Pearce (:cpearce)</span>]
soup.select uses css selector and are, in my opinion, much more capable than the default element search that comes with BeautifulSoup. Note that all results are returned as list and contains everything that match.

Get a link text and element text based on a condition

I have to get the text and link of an element if there is 'theme-cell-card Ace' else not. Following is the sample html code:
<div class="theme-grid-cell-frame">
<a href="/t/490">
<div class="theme-cell">
<div class="image"></div>
<div class="theme-cell-overlay deep"></div>
<h1 class="theme-cell-name"> textqwqw</h1>
<div class="theme-cell-card Ace"></div>
</div>
</a>
</div>
<div class="theme-grid-cell-frame">
<a href="/o/434">
<div class="theme-cell">
<div class="image"></div>
<div class="theme-cell-overlay deep"></div>
<h1 class="theme-cell-name"> textegg</h1>
<div class="theme-cell-card Jack"></div>
</div>
</a>
</div>
<div class="theme-grid-cell-frame">
<a href="/t/4665">
<div class="theme-cell">
<div class="image"></div>
<div class="theme-cell-overlay deep"></div>
<h1 class="theme-cell-name"> textdgfh</h1>
<div class="theme-cell-card Ace"></div>
</div>
</a>
</div>
<div class="theme-grid-cell-frame">
<a href="/o/764">
<div class="theme-cell">
<div class="image"></div>
<div class="theme-cell-overlay deep"></div>
<h1 class="theme-cell-name"> textgrth</h1>
</div>
</a>
</div>
I am able to get text of an element but I want to pass the condition class="theme-cell-card Ace" is true.
${grid} Set Variable //div[#class='theme-cell']
#{elements} Get Webelements ${grid}
:FOR ${element} IN #{elements}
\ ${text} Get Text ${element}
I am a newbie, so please let me know if you need more info. Thank you
I don't know the robot framework but this is the XPath locator you want
//a[.//div[#class='theme-cell-card Ace']]
That will get you the A tags that contain a DIV that has the desired classes. You can get the href from that element along with the contained text.
Since your question is tagged python, you can use something simple like
aces = driver.find_elements_by_xpath("//a[.//div[#class='theme-cell-card Ace']]")
for ace in aces
print(ace.get_attribute("href"))
print(ace.text)
#{elements} Get Webelements //a[.//div[#class='theme-cell-card Ace']]
:FOR ${element} IN #{elements}
\ ${text} Get Text ${element}
\ ${link} SeleniumLibrary.Get Element Attribute ${element} attribute=href
\ Log to console ${text}
\ Log to console ${link}

Append element after another element using lxml

I have the following HTML markup
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
To fix some CSS issue, I want to append a div tag <div style="clear:both"></div> after the content_nav div like this
<div id="contents">
<div id="content_nav">
something goes here
</div>
<div style="clear:both"></div>
<p>
some contents
</p>
</div>
I am doing it this way:
import lxml.etree
tree = lxml.etree.fromString(inputString, parser=lxml.etree.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentnav.append(lxml.etree.XML("<div style='clear: both'></div>"))
But that doesn't append the new div right after content_nav div but inside.
<div id="content_nav">
something goes here
<div style="clear:both"></div>
</div>
Is there any way to add a div in the middle of content_nav div and some p like that inside contents?
Thanks
Instead of appending to contentnav, go up to the parent (contentdiv) and insert the new div at a particular index. To find that index, use contentdiv.index(contentnav), which gives the index of contentnav within contentdiv. Adding one to that gives the desired index.
import lxml.etree as ET
content = '''\
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
'''
tree = ET.fromstring(content, parser=ET.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentdiv = contentnav.getparent()
contentdiv.insert(contentdiv.index(contentnav)+1,
ET.XML("<div style='clear: both'></div>"))
print(ET.tostring(tree))
yields
<html><body><div id="contents">
<div id="content_nav">
something goes here
</div>
<div style="clear: both"/><p>
some contents
</p>
</div></body></html>
Use addprevious and addnext for prepending and appending siblings.
An lxml.etree _Element has two methods: addprevious and addnext for doing exactly what you want.
import lxml.etree as ET
content='''\
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
'''
tree = ET.fromstring(content, parser=ET.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentnav.addnext(ET.XML("<div style='clear: both'></div>"))
print(ET.tostring(tree))
Output:
<html><body><div id="contents">
<div id="content_nav">
something goes here
</div><div style="clear: both"/>
<p>
some contents
</p>
</div>
</body></html>
I believe that a generic function addressing the question "insert an element after another element" might be useful, even if it's just a reformulation of the accepted answer:
def insert_after(element, new_element):
parent = element.getparent()
parent.insert(parent.index(element)+1, new_element)
which allows to insert a new_element after an existing element with just
insert_after(element, new_element)

Categories

Resources