Get values from an xml file with python - python

Context:
I want to obtain information from an xml file. when doing print
"<Element 'vocation' at 0x000001C9C7849040>" I do not want that message to appear but the value
Original file XML:
<?xml version="1.0"?>
<spells>
<instant name="spell 1">
<vocation name="men 50"/>
<vocation name="men 100"/>
</instant>
<instant name="spells 2">
<vocation name="woman 50"/>
<vocation name="woman 100"/>
</instant>
</spells>
python code proposition
import xml.etree.ElementTree as ET
tree = ET.parse('D:\\desktop\\Archivos\\compilaciones\\1.5\\800\\data\\spells\\spells.xml')
root = tree.getroot()
for i in root.findall('instant'):
name_spells = i.get('name')
value_name_vocation = i.find('.//vocation[last()]')
print(name_spells, value_name_vocation)
Output Problem(<Element 'vocation' at 0x000001C9C7849040>)
spells1 <Element 'vocation' at 0x000001C9C7AB9810>
spells2 <Element 'vocation' at 0x000001C9C7849040>
desirable output
spells1 men 100
spells2 woman 100
I was searching and some functions don't work.
.text -> AttributeError: 'NoneType' object has no attribute 'text'
->'NoneType' object has no attribute 'iter'
I am open to using other libraries

Since you have the element object already you just need to get its name from within it. Like you already do with your counter variable i.
import xml.etree.ElementTree as ET
tree = ET.parse('D:\\desktop\\Archivos\\compilaciones\\1.5\\800\\data\\spells\\spells.xml')
root = tree.getroot()
for i in root.findall('instant'):
name_spells = i.get('name')
value_name_vocation = i.find('.//vocation[last()]')
print(name_spells, value_name_vocation.get('name'))

Related

how to parse XML with namespace and attribute in Python?

hi I am trying to parse xml with namespace and attribute.
I am almost close by using root.findall() and .get()
However still struggling to get the accurate values from xml file.
How to get the xml attribute values ?
Input:
<?xml version="1.0" encoding="UTF-8"?><message:GenericData
xmlns:message="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message"
xmlns:common="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:generic="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic"
xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXMessage.xsd
http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXCommon.xsd
http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXDataGeneric.xsd">
<generic:Obs>
<generic:ObsDimension value="1999-01"/>
<generic:ObsValue value="0.7029125"/>
</generic:Obs>
<generic:Obs>
<generic:ObsDimension value="1999-02"/>
<generic:ObsValue value="0.688505"/>
</generic:Obs>
Code:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
for x in root.findall('.//'):
print(x.tag, " ", x.get('value'))
Output:
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}Obs None
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsDimension 1999-01
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsValue 0.7029125
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}Obs None
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsDimension 1999-02
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsValue 0.688505
Expected_Output:
1999-01 0.7029125
1999-02 0.688505
How about this:
for parent in root:
print(' '.join([child.get('value', "") for child in parent]))

how to check if an attribute <Reporting_date> YYYYMMDD </Reporting_Date> in a .xml file is equal to a fixed Date value

I am new to python and wondering how to solve below use-case using python script in the shell script.
I have a shell script which holds variables as file name, ODATE value which is fixed in the format YYYYMMDD. It does some checks on file name.
In the next step I want to run a python script which actually checks the attribute <Reporting_date> YYYYMMDD </Reporting_Date> value for every occurrence in the test.xml file is equal to a fixed value from ODATE.
If all the values for <Reporting_date> YYYYMMDD </Reporting_Date> are matching the ODATE then print message "all the attributes are matched".
If we find one mismatch while scanning multiple records in test.xml file, on the 1st mismatch itself stop scanning the entire test.xml file, and print message "Encountered a mismatch"
Could anyone of you please guide me with this use-case. It would be much helpful.
Many thanks in advance.
How about something like this (I haven't test this code and I don't know structure of your XML file, but you should get an idea).
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
ODATE = '20210215'
allok = True
for child in root.iter('Reporting_date'):
if child.text != ODATE:
allok = False
print ("Encountered a mismatch")
break
if allok:
print("All attributes are matched")
For more information see https://docs.python.org/3/library/xml.etree.elementtree.html
--- EDIT using findall instead of iter (see comments) ---
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
ODATE = '20210215'
allok = True
for directchild in root:
for child in directchild.findall("REPORTING_DATE"):
# this supports multiple reporting_date in one POSITION structure.
# If you know that there always is only one, find would be better.
if child.text != ODATE:
allok = False
print ("Encountered a mismatch")
break
if allok:
print("All attributes are matched")
See below
import xml.etree.ElementTree as ET
XML = '''<?xml version="1.0" encoding="UTF-8"?>
<POSITIONS>
<POSITION>
<ISIN>aaaaaaa</ISIN>
<ACCOUNT>7777777</ACCOUNT>
<POSITION>11111</POSITION>
<SETTLEMENT_DATE>20210202</SETTLEMENT_DATE>
<REPORTING_DATE>20210202</REPORTING_DATE>
</POSITION>
<POSITION>
<ISIN>bbbbbbb</ISIN>
<ACCOUNT>66666666</ACCOUNT>
<POSITION>888888888</POSITION>
<SETTLEMENT_DATE>20210203</SETTLEMENT_DATE>
<REPORTING_DATE>20210215</REPORTING_DATE>
</POSITION>
</POSITIONS>'''
ODATE = '20210215'
root = ET.fromstring(XML)
dates = root.findall('.//REPORTING_DATE')
for date in dates:
if date.text == ODATE:
print(f'ODATE {ODATE} and REPORTING_DATE {date.text} are the same dates')
else:
print(f'ODATE {ODATE} and REPORTING_DATE {date.text} are NOT the same dates')

Iterating through xml file

I am trying to get all surnames from xml file, but if I am trying to use find, It throws an exception
TypeError: 'NoneType' object is not iterable
This is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root:
for subelem in elem:
for subsubelem in subelem.find('surname'):
print(subsubelem.text)
When I remove the find('surname') from code, It returning all texts from subsubelements.
This is xml:
<?xml version="1.0" encoding="UTF-8"?>
<pp:card xmlns:pp="http://xmlns.page.com/path/subpath">
<pp:id>1</pp:id>
<pp:customers>
<pp:customer>
<pp:name>John</pp:name>
<pp:surname>Walker</pp:surname>
<pp:adress>
<pp:street>Walker street</pp:street>
<pp:number>1/1</pp:number>
<pp:state>England</pp:state>
</pp:adress>
<pp:created>2021-03-08Z</pp:created>
</pp:customer>
<pp:customer>
<pp:name>Michael</pp:name>
<pp:surname>Jordan</pp:surname>
<pp:adress>
<pp:street>Jordan street</pp:street>
<pp:number>28</pp:number>
<pp:state>USA</pp:state>
</pp:adress>
<pp:created>2021-03-09Z</pp:created>
</pp:customer>
</pp:customers>
</pp:card>
How should I fix it?
Not really a python person, but should the "find" statement include the "pp:" in its search, such as,
find('pp:surname')
Neither the opening nor closing tags actually match "surname".
Use the namespace when you call findall
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<pp:card xmlns:pp="http://xmlns.page.com/path/subpath">
<pp:id>1</pp:id>
<pp:customers>
<pp:customer>
<pp:name>John</pp:name>
<pp:surname>Walker</pp:surname>
<pp:adress>
<pp:street>Walker street</pp:street>
<pp:number>1/1</pp:number>
<pp:state>England</pp:state>
</pp:adress>
<pp:created>2021-03-08Z</pp:created>
</pp:customer>
<pp:customer>
<pp:name>Michael</pp:name>
<pp:surname>Jordan</pp:surname>
<pp:adress>
<pp:street>Jordan street</pp:street>
<pp:number>28</pp:number>
<pp:state>USA</pp:state>
</pp:adress>
<pp:created>2021-03-09Z</pp:created>
</pp:customer>
</pp:customers>
</pp:card>'''
ns = {'pp': 'http://xmlns.page.com/path/subpath'}
root = ET.fromstring(xml)
names = [sn.text for sn in root.findall('.//pp:surname', ns)]
print(names)
output
['Walker', 'Jordan']

Extracting Child XML using ElementTree ignoring Namespace

I have the following XML that I would like to extract a portion of the child if name matches "Adam"
<data>
<a:config version="1.0" xmlns:a="uri:abc.com/a" xmlns:b="uri:abc.com/b">
<a:xxx config="ABC">
<set>option_on</set>
<location>/123/123</location>
<data>123</data>
</a:xxx>
<a:xxx name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>
<a:xxx name="Lisa">
<a:yyy value="2222-2222">
<log>false</log>
</a:yyy>
</a:xxx>
</a:config>
</data>
I manage to extract the section but it doesn't output the original namespace rather it is showing ns0 and ns1. Below is my code
import xml.etree.ElementTree as ET
tree2 = ET.parse("mycode.xml")
root2= tree2.getroot()
for elem in tree2.iter(tag='{uri:abc.com/a}xxx'):
match = elem.get('name')
if match == "Adam":
bla = ET.dump(elem)
Output as follows: -
<ns0:xxx xmlns:ns0="uri:abc.com/a" name="Adam">
<ns0:yyy value="5555-5555">
<log>true</log>
</ns0:yyy>
</ns0:xxx>
I am hoping to get exactly as what the original document is:-
<a:xxx name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>
Use the register_namespace function.
import xml.etree.ElementTree as ET
tree2 = ET.parse("mycode.xml")
root2 = tree2.getroot()
# Register the 'a' prefix to be used when serializing
ET.register_namespace("a", "uri:abc.com/a")
for elem in tree2.iter(tag='{uri:abc.com/a}xxx'):
match = elem.get('name')
if match == "Adam":
bla = ET.dump(elem)
Output:
<a:xxx xmlns:a="uri:abc.com/a" name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>
This is not the exact output that you asked for. You cannot force ElementTree to omit the namespace declaration (because doing so would make the output ill-formed).

Reading Maven Pom xml in Python

I have a pom file that has the following defined:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.welsh</groupId>
<artifactId>my-site</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<profiles>
<profile>
<build>
<plugins>
<plugin>
<groupId>org.welsh.utils</groupId>
<artifactId>site-tool</artifactId>
<version>1.0</version>
<executions>
<execution>
<configuration>
<mappings>
<property>
<name>homepage</name>
<value>/content/homepage</value>
</property>
<property>
<name>assets</name>
<value>/content/assets</value>
</property>
</mappings>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
And I am looking to build a dictionary off the name & value elements under property under the mappings element.
So what I'm trying to figure out how to get all possible mappings elements (Incase of multiple build profiles) so I can get all property elements under it and from reading about Supported XPath syntax the following should print out all possible text/value elements:
import xml.etree.ElementTree as xml
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
for mapping in root.findall('*/mappings'):
for prop in mapping.findall('.//property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which is returning nothing. I tried just printing out all the mappings elements and get:
>>> print root.findall('*/mappings')
[]
And when I print out the everything from root I get:
>>> print root.findall('*')
[<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x10b38bd50>, <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x10b38bd90>, <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x10b38bf10>, <Element '{http://maven.apache.org/POM/4.0.0}version' at 0x10b3900d0>, <Element '{http://maven.apache.org/POM/4.0.0}packaging' at 0x10b390110>, <Element '{http://maven.apache.org/POM/4.0.0}name' at 0x10b390150>, <Element '{http://maven.apache.org/POM/4.0.0}properties' at 0x10b390190>, <Element '{http://maven.apache.org/POM/4.0.0}build' at 0x10b390310>, <Element '{http://maven.apache.org/POM/4.0.0}profiles' at 0x10b390390>]
Which made me try to print:
>>> print root.findall('*/{http://maven.apache.org/POM/4.0.0}mappings')
[]
But that's not working either.
Any suggestions would be great.
Thanks,
The main issues of the code in the question are
that it doesn't specify namespaces, and
that it uses */ instead of // which only matches direct children.
As you can see at the top of the XML file, Maven uses the namespace http://maven.apache.org/POM/4.0.0. The attribute xmlns in the root node defines the default namespace. The attribute xmlns:xsi defines a namespace that is only used for xsi:schemaLocation.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
To specify tags like profile in methods like find, you have to specify the namespace as well. For example, you could write the following to find all profile-tags.
import xml.etree as xml
pom = xml.parse('pom.xml')
for profile in pom.findall('//{http://maven.apache.org/POM/4.0.0}profile'):
print(repr(profile))
Also note that I'm using //. Using */ would have the same result for your specific xml file above. However, it would not work for other tags like mappings. Since * represents only one level, */child can be expanded to parent/tag or xyz/tag but not to xyz/parent/tag.
Now, you should be able to come up with something like this to find all mappings:
pom = xml.parse('pom.xml')
map = {}
for mapping in pom.findall('//{http://maven.apache.org/POM/4.0.0}mappings'
'/{http://maven.apache.org/POM/4.0.0}property'):
name = mapping.find('{http://maven.apache.org/POM/4.0.0}name').text
value = mapping.find('{http://maven.apache.org/POM/4.0.0}value').text
map[name] = value
Specifying the namespaces like this is quite verbose. To make it easier to read, you can define a namespace map and pass it as second argument to find and findall:
# ...
nsmap = {'m': 'http://maven.apache.org/POM/4.0.0'}
for mapping in pom.findall('//m:mappings/m:property', nsmap):
name = mapping.find('m:name', nsmap).text
value = mapping.find('m:value', nsmap).text
map[name] = value
Ok, found out that when I remove maven stuff from the project element so its just <project> I can do this:
for mapping in root.findall('*//mappings'):
logging.info(mapping)
for prop in mapping.findall('./property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which would result in:
INFO:root:<Element 'mappings' at 0x10d72d350>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, if I leave the Maven stuff in at the top I can do this:
for mapping in root.findall('*//{http://maven.apache.org/POM/4.0.0}mappings'):
logging.info(mapping)
for prop in mapping.findall('./{http://maven.apache.org/POM/4.0.0}property'):
logging.info(prop.find('{http://maven.apache.org/POM/4.0.0}name').text + " => " + prop.find('{http://maven.apache.org/POM/4.0.0}value').text)
Which results in:
INFO:root:<Element '{http://maven.apache.org/POM/4.0.0}mappings' at 0x10aa7f310>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, I'd love to be able to figure out how to avoid having to account for the maven stuff since it locks me into this one format.
EDIT:
Ok, I managed to get something a bit more verbose:
import xml.etree.ElementTree as xml
def getMappingsNode(node, nodeName):
if node.findall('*'):
for n in node.findall('*'):
if nodeName in n.tag:
return n
else:
return getMappingsNode(n, nodeName)
def getMappings(rootNode):
mappingsNode = getMappingsNode(rootNode, 'mappings')
mapping = {}
for prop in mappingsNode.findall('*'):
key = ''
val = ''
for child in prop.findall('*'):
if 'name' in child.tag:
key = child.text
if 'value' in child.tag:
val = child.text
if val and key:
mapping[key] = val
return mapping
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
mappings = getMappings(root)
print mappings

Categories

Resources