Append element after another element using lxml - python

I have the following HTML markup
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
To fix some CSS issue, I want to append a div tag <div style="clear:both"></div> after the content_nav div like this
<div id="contents">
<div id="content_nav">
something goes here
</div>
<div style="clear:both"></div>
<p>
some contents
</p>
</div>
I am doing it this way:
import lxml.etree
tree = lxml.etree.fromString(inputString, parser=lxml.etree.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentnav.append(lxml.etree.XML("<div style='clear: both'></div>"))
But that doesn't append the new div right after content_nav div but inside.
<div id="content_nav">
something goes here
<div style="clear:both"></div>
</div>
Is there any way to add a div in the middle of content_nav div and some p like that inside contents?
Thanks

Instead of appending to contentnav, go up to the parent (contentdiv) and insert the new div at a particular index. To find that index, use contentdiv.index(contentnav), which gives the index of contentnav within contentdiv. Adding one to that gives the desired index.
import lxml.etree as ET
content = '''\
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
'''
tree = ET.fromstring(content, parser=ET.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentdiv = contentnav.getparent()
contentdiv.insert(contentdiv.index(contentnav)+1,
ET.XML("<div style='clear: both'></div>"))
print(ET.tostring(tree))
yields
<html><body><div id="contents">
<div id="content_nav">
something goes here
</div>
<div style="clear: both"/><p>
some contents
</p>
</div></body></html>

Use addprevious and addnext for prepending and appending siblings.
An lxml.etree _Element has two methods: addprevious and addnext for doing exactly what you want.
import lxml.etree as ET
content='''\
<div id="contents">
<div id="content_nav">
something goes here
</div>
<p>
some contents
</p>
</div>
'''
tree = ET.fromstring(content, parser=ET.HTMLParser())
contentnav = tree.find(".//div[#id='content_nav']")
contentnav.addnext(ET.XML("<div style='clear: both'></div>"))
print(ET.tostring(tree))
Output:
<html><body><div id="contents">
<div id="content_nav">
something goes here
</div><div style="clear: both"/>
<p>
some contents
</p>
</div>
</body></html>

I believe that a generic function addressing the question "insert an element after another element" might be useful, even if it's just a reformulation of the accepted answer:
def insert_after(element, new_element):
parent = element.getparent()
parent.insert(parent.index(element)+1, new_element)
which allows to insert a new_element after an existing element with just
insert_after(element, new_element)

Related

How to select multiple children from HTML tag with Python/BeautifulSoup if exists?

I'm currently scraping elements from a webpage. Let's say i'm iterating over a HTML reponse and a part of that response looks like this:
<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>
I know I can access the first element under title within the span class like so:
row[-1].find('span')['title']
"SLT-4 2435
But I would like to select the second title under the span class (if it exists) as a string too, like so: "SLT-4 2435, SLT-6 2631"
Any ideas?
You can use the find_all() function to find all the span elements with class material-part
titles = []
for material_part in row[-1].find_all('span', class_='material-part'):
titles.append(material_part['title'])
result = ', '.join(titles)
In alternativ to find() / find_all() you could use css selectors:
soup.select('span.material-part[title]')
,iterate the ResultSet with list comprehension and join() your texts to a single string:
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Example
from bs4 import BeautifulSoup
html = '''<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>'''
soup = BeautifulSoup(html)
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Output
SLT-4 2435,SLT-6 2631

How to get all text in inside div parent with xpath

I want to get all text inside a div with xpath
Here HTML code:
<div class="JobDescriptionsc__DescriptionContainer-sc-1jylha1-2 dGyoDf">
<div class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf">
<div class="DraftEditor-root">
<div class="DraftEditor-editorContainer">
<div class="public-DraftEditor-content" contenteditable="false" spellcheck="false" style="outline:none;user-select:text;-webkit-user-select:text;white-space:pre-wrap;word-wrap:break-word">
<div data-contents="true">
#Here the all text
<div class="" data-block="true" data-editor="d54la" data-offset-key="bhkoa-0-0">
<div data-offset-key="bhkoa-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="bhkoa-0-0" style="font-weight:bold">
<span data-text="true">Job Description:</span>
</span>
</div>
</div>
<div class="" data-block="true" data-editor="d54la" data-offset-key="51e5u-0-0">
<div data-offset-key="51e5u-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="51e5u-0-0">
<span data-text="true">· Identify & developed application base on predefined business requirements.</span>
</span>
</div>
</div>
...
#there's more, I'm just showing you a few
</div>
</div>
</div>
</div>
</div>
</div>
This my XPath code:
dom_job.xpath('//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()')
I need the all text inside the div parent with xpath, can it?
I'm assuming the Python module which provides your XPath interpreter supports XPath version 1. Your XPath expression below returns the set of all text nodes which are descendants of the div element:
//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()
You should be able to iterate over all that collection of text nodes, and concatenate them into a single string, in Python.
But it's simpler, if you want the concatenated value of the text nodes within a particular div, to just apply the XPath string() function to the div; e.g.:
string(//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"])
See https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
Note that, in XPath 1, if you apply the string() function to a larger set of nodes (such as the set of text nodes returned by your first query), the function will return the string value of just the first node.

Css selector of parent text

I want to get this figure $185,000,000. Is there any way to get text from parent tag and avoiding text from child tags
<div class="txt-block">
<h4 class="inline">Budget:</h4>
$185,000,000
<span class="attribute">(estimated)</span>
</div>
<div class="txt-block">
<h4 class="inline">Budget:</h4>
<span class="value">$185,000,000</span>
<span class="attribute">(estimated)</span>
</div>
Yes you can do this. Simply write
response.css('.txt-block::text').extract_first()
This will return only $185,000,000. If you put space between :: and .txt-block. This extract the text of children also

how to access elements by path?

I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text

Extract outer div using BeautifulSoup

If the HTML code looks like this:
<div class="div1">
<p>hello</p>
<p>hi</p>
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
</div>
How do I extract just
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.
You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
Why not just find() the nested div and then remove it from the tree using extract()?

Categories

Resources