Extract all the text from xml data with python

Extract all the text from xml data with python - python

I'm new to xml data processing. I want to extract the text data in the following xml file:
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
so that expected result is:
['12345','45667', 'abcde'] Currently I have tried:
tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]
But the result only shows ['12345','45667'] . 'abcde' is missing. Can someone help me? Thanks in advance!

Try doing this using xpath and lxml :
import lxml.etree as etree
string = '''
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
'''
tree = etree.fromstring(string)
print(tree.xpath('//p//text()'))
The Xpath expression means: "select all p elements wich containing text recursively"
OUTPUT:
['12345', '45667', 'abcde']

getiterator() (or it's replacement iter()) iterates over child tags/elements, while abcde is a text node, a tail of the strong tag.
You can use itertext() method:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
print list(tree.find('p').itertext())
Prints:
['12345', '45667', 'abcde']

Related

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.

You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Python - Deep XML file for loop

I am working with a XML file that looks like the code below, the real one has a lot more spreekbeurt sessions but I made it readable. My goal is to get from all the spreekbeurt sessions the text in the voorvoegsel and achternaam part.
<?xml version="1.0" encoding="utf-8"?>
<officiele-publicatie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://technische-documentatie.oep.overheid.nl/schema/op-xsd-2012-2">
<metadata>
<meta name="OVERHEIDop.externMetadataRecord" scheme="" content="https://zoek.officielebekendmakingen.nl/h-tk-20122013-4-2/metadata.xml" />
</metadata>
<handelingen>
<spreekbeurt nieuw="ja">
<spreker>
<voorvoegsels>De heer</voorvoegsels>
<naam>
<achternaam>Marcouch</achternaam>
</naam> (<politiek>PvdA</politiek>):</spreker>
<tekst status="goed">
<al>Sample Text</al>
</tekst>
</spreekbeurt>
</agendapunt>
</handelingen>
</officiele-publicatie>
I use a for loop to loop through all the spreekbeurt elemets in my XML file. But how do I print out the voorvoegsels and achternaam for every spreekbeurt in my XML file?
import xml.etree.ElementTree as ET
tree = ET.parse('...\directory')
root = tree.getroot()
for spreekbeurt in root.iter('spreekbeurt'):
print spreekbeurt.attrib
This code prints:
{'nieuw': 'nee'}
{'nieuw': 'ja'}
{'nieuw': 'nee'}
{'nieuw': 'nee'}
but how do I get the children printed out of the spreekbeurt?
Thanks in advance!

You can use find() passing path* to the target element to find individual element within a parent/ancestor, for example :
>>> for spreekbeurt in root.iter('spreekbeurt'):
... v = spreekbeurt.find('spreker/voorvoegsels')
... a = spreekbeurt.find('spreker/naam/achternaam')
... print v.text, a.text
...
De heer Marcouch
*) in fact it supports more than just simple path, but subset of XPath 1.0 expressions.

Adding subElement at a specific location with xml.dom.minidom (appendChild)

I intend to insert a sub element at a specified location. However, I do not know how to do that using appendChild in xml.dom
Here is my xml code:
<?xml version='1.0' encoding='UTF-8'?>
<VOD>
<root>
<ab>sdsd
<pp>pras</pp>
<ps>sinha</ps>
</ab>
<ab>prashu</ab>
<ab>sakshi</ab>
<cd>dfdf</cd>
</root>
<root>
<ab>pratik</ab>
</root>
<root>
<ab>Mum</ab>
</root>
</VOD>
I would like to insert another sub element "new" in first "root" element just before the "cd" tag. The result should look like this:
<ab>prashu</ab>
<ab>sakshi</ab>
<new>Anydata</new>
<cd>dfdf</cd>
The code I used for this is:
import xml.dom.minidom as m
doc = m.parse("file_notes.xml")
root=doc.getElementsByTagName("root")
valeurs = doc.getElementsByTagName("root")[0]
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
valeurs.appendChild(element)
doc.writexml(open("newxmlfile.xml","w"))
In what way can I achieve my goal?
Thank you in advance..!!

Try using insertBefore instead. Something along these lines:
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
cd = doc.getElementsByTagName("cd")[0]
cd.parentNode.insertBefore(element, cd)
To insert new nodes based on an index you can just do:
cd_list = doc.getElementsByTagName("cd")
cd_list[0].parentNode.insertBefore(element, cd_list[0])

Using lxml to add a string as a sub element

I have an lxml element with children built like this:
xml = etree.Element('presentation')
format_xml = etree.SubElement(xml, 'format')
content_xml = etree.SubElement(xml, 'slides')
I then have several strings that I would like it iterate over and add each as child element to slides. Each string will be something like this:
<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>
How can I append these strings as children to the slides element?

Just append:
import lxml.etree as etree
xml = etree.Element('presentation')
format_xml = etree.SubElement(xml, 'format')
content_xml = etree.SubElement(xml, 'slides')
new = """<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>"""
content_xml.append(etree.fromstring(new))
print(etree.tostring(xml,pretty_print=1))
Which will give you:
<presentation>
<format/>
<slides>
<slide1>
<title>My Presentation</title>
<subtitle>A sample presentation</subtitle>
<phrase>Some sample text
<subphrase>Some more text</subphrase>
</phrase>
</slide1>
</slides>
</presentation>

fromstring() function would load an XML string directly into an Element instance which you can append:
from lxml import etree as ET
slide = ET.fromstring(xml_string)
content_xml.append(slide)

Just returning the text of elements in xpath (python / lxml)

I have an XML structure like this:
mytree = """
<path>
<to>
<nodes>
<info>1</info>
<info>2</info>
<info>3</info>
</nodes>
</to>
</path>
"""
I'm currently using xpath in python lxml to grab the nodes:
>>> from lxml import etree
>>> info = etree.XML(mytree)
>>> print info.xpath("/path/to/nodes/info")
[<Element info at 0x15af620>, <Element info at 0x15af940>, <Element info at 0x15af850>]
>>> for x in info.xpath("/path/to/nodes/info"):
print x.text
1
2
3
This is great, but is there a cleaner way to grab just the internal texts as a list, rather than having to write the for-loop afterwards?
Something like:
print info.xpath("/path/to/nodes/info/text")
(but that doesn't work)

You can use:
print info.xpath("/path/to/nodes/info/text()")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract all the text from xml data with python - python

getiterator() (or it's replacement iter()) iterates over child tags/elements, while abcde is a text node, a tail of the strong tag. You can use itertext() method: import xml.etree.ElementTree as ET tree = ET.parse('test.xml') print list(tree.find('p').itertext()) Prints: ['12345', '45667', 'abcde']

Related

Get children elements of multiple instances of the same name tag using ElementTree

Python - Deep XML file for loop

Adding subElement at a specific location with xml.dom.minidom (appendChild)

Using lxml to add a string as a sub element

Just returning the text of elements in xpath (python / lxml)

Categories

Resources