Python XML check next item - python

Here is a little xml example:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
...
...
</list>
Now I need all Persons with a name and city.
I tried:
#!/usr/bin/python
# coding: utf8
import xml.dom.minidom as dom
tree = dom.parse("test.xml")
for listItems in tree.firstChild.childNodes:
for personItems in listItems.childNodes:
if personItems.nodeName == "name" and personItems.nextSibling == "city":
print personItems.firstChild.data.strip()
But the ouput is empty. Without the "and" condition I become all names. How can I check that the next tag after "name" is "city"?

You can do this in minidom:
import xml.dom.minidom as minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
xmldoc = minidom.parse('test.xml')
person = getChild(xmldoc, 'list')
for p in person:
for v in getChild(p,'person'):
attr = v.getAttributeNode('id')
if attr:
print attr.nodeValue.strip()
This prints id of person nodes:
1
2

use element tree check this element tree
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
try:
city = person.find('city').text
except:
continue
print name, city
for id u can get it by id= person.get('id')
output:Smith New York

Using lxml, you can use xpath to get in one step what you need:
from lxml import etree
xmlstr = """
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
</list>
"""
xml = etree.fromstring(xmlstr)
xp = "//person[city]"
for person in xml.xpath(xp):
print etree.tostring(person)
lxml is external python package, but is so useful, that to me it is always worth to install.
xpath is searching for any (//) element person having (declared by content of []) subelement city.

Related

Export information from child nodes in xml using Python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Issues indexing data extract from xml

I'm working with some XML data that will ultimately be loaded into a csv. I am experiencing an issue with properly indexing the data when a element doesn't exist in an entry. Below is a simple xml example of what I am working with
<root>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jon</FIRSTNAME>
<GENDER>M</GENDER>
</entry>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jane</FIRSTNAME>
<GENDER>F</GENDER>
<HAIRCOLOR>Blonde</HAIRCOLOR>
</entry>
</root>
The output I end up getting is as follows:
LASTNAME
FIRSTNAME
GENDER
HAIRCOLOR
Doe
John
M
Blonde
Doe
Jane
F
But the correct output should be:
LASTNAME
FIRSTNAME
GENDER
HAIRCOLOR
Doe
John
M
Doe
Jane
F
Blonde
So I seem to have an indexing problem where the first few times HAIRCOLOR (depending on the number of HAIRCOLOR elements are present on the page) is searched for, it goes down the XML until it finds one, but it should stop when it reaches the end of the entry.
Here's the code I am working with:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
from xml.etree import ElementTree
bytes_ = '''
<root>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jon</FIRSTNAME>
<GENDER>M</GENDER>
</entry>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jane</FIRSTNAME>
<GENDER>F</GENDER>
<HAIRCOLOR>Blonde</HAIRCOLOR>
</entry>
</root>
'''
xpaths = [
"./entry/LASTNAME",
"./entry/FIRSTNAME",
"./entry/GENDER",
"./entry/HAIRCOLOR"
]
data = []
_fields = [
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''}
]
root = ET.fromstring(bytes_)
for count in range(0,len(root.findall("./entry"))):
for ele, xpath in enumerate(xpaths):
try:
attribs = list(root.findall(xpath)[count].attrib.keys())
for attrib in attribs:
for i in _fields[ele].keys():
if attrib == i:
_fields[ele][i] = root.findall(xpath)[count].attrib[attrib]
_fields[ele]["text"] =root.findall(xpath)[count].text
except IndexError:
_fields[ele]["text"]=''
data.append(_fields[ele].values())
data_list = [item for sublist in data for item in sublist]
data.clear()
print(data_list)
Any help is appreciated.
Edited for clarity
Your xpath ’./entry/HAIRCOLOR’ will match the HAIRCOLOR tag wherever it is, i.e. anywhere. You need to first find each ’./entry’ then for each entry look for HAIRCOLOR and the other tags within the entry. At the moment you’re looking in root For the sub-tags.

Parse XML with childs that have different tags in Python

I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance
As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}
Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.
If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746

How to create a subset of document using lxml?

Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>

How to add xml nodes in python using ElementTree

i have xml file like
<data>
<person>
<Name>xyz</Name>
<add>abc</add>
</person>
</data>
i want to add another person node like
<data>
<person>
<Name>xyz</Name>
<add>abc</add>
</person>
<person>
<Name>def</Name>
</person>
</data>
my current python code is
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import ElementTree
root = ET.parse("Lexicon.xml").getroot()
creRoot = Element("person")
creDictionary = Element("Name")
creDictionary.text = "def"
creRoot.append(creDictionary)
print(ET.tostring(creRoot))
creTree= ElementTree(creRoot)
creTree.write("Lexicon.xml")
when i run this code it will create xml file rather then add and the result is
<person>
<Name>def</Name>
</person>
and it will remove all previous data..
Kindly anyone who can solve it.. Thanks in advance
SubElement shall be used to add nodes to existing node:
import xml.etree.ElementTree as etree
data = etree.XML(input)
person = etree.SubElement(data, 'person')
name = etree.SubElement(person, 'Name')
name.text = 'def'
print(etree.tostring(data))
We need to append new create element to respective parent element.
Demo:
>>> import xml.etree.ElementTree as ET
>>> input_data = """<data>
... <person>
... <Name>xyz</Name>
... <add>abc</add>
... </person>
... </data>"""
#- Create new Element.
>>> person_tag = ET.Element("person")
>>> name_tag = ET.Element("Name")
#- Add text to Element.
>>> name_tag.text = "def"
#- Append Element to Parent Element.
>>> person_tag.append(name_tag)
>>>
#- Just print Parent Element
>>> ET.tostring(person_tag)
'<person><Name>def</Name></person>'
>>>
>>>
#- Created ET object by formstring.
>>> root = ET.fromstring(input_data)
>>>
#- Append above element to root element
>>> root.append(person_tag)
#- Print root Element.
>>> print ET.tostring(root)
<data>
<person>
<Name>xyz</Name>
<add>abc</add>
</person>
<person><Name>def</Name></person></data>
>>> print ET.tostring(root, method="xml")
<data>
<person>
<Name>xyz</Name>
<add>abc</add>
</person>
<person><Name>def</Name></person></data>
>>>
Note: Best to use lxml b

Categories

Resources