XML file to Excel, error when opening - python

I have this file here:
<?xml?>
<table name="data">
<row et_kt="215846" et_nafn="" et_kt_maka="" et_kt_fjolsk="215846" et_kyn="X" et_hjusk_stada="1" et_faeddag="190201" et_danrdag="198612" />
<row et_kt="239287" et_nafn="" et_kt_maka="" et_kt_fjolsk="239287" et_kyn="X" et_hjusk_stada="4" et_faeddag="190401" et_danrdag="199106" />
.
.
.
</table>
Excel tell me the file is in a different format than the .xml implies. What's wrong with the format?

Try changing <?xml?> for <?xml version="1.0"?>.
EDIT: Check this answer for some extra information about the issue.

at first glance I'd say that that there is no valid xml declaration ie
<?xml version="1.0" encoding="UTF-8" ?>
or as it a microsoft product may be you should try <?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>

Related

CDATA sections and comments are lost when parsing XML with ElementTree

I am editing xml files, I ran into the problem that when changing a file in a python script, its structure is lost.
Xml file:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<code_main />
</errors>
<reference>3</reference>
</element>
....
</main>
Используя:
tree = ET.parse(xml_file).write("test.xml", encoding='utf-8', xml_declaration=True)
I lose all comments in the file, while if I compare the original file with the modified one using diff (in linux), the files are shown as completely different
Is there a way to change the xml file (my task is to add a subelement to <element>), while leaving the overall structure of the file unchanged, including comments and order.
The order and comments are fundamental in the file
UPD:
After executing the above code, I get it from the source xml in the following form:
<?xml version='1.0' encoding='utf-8'?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Pay attention to <path>
Comments are also not saved at the same time:
Source:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<!--Stt-->
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Modified:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>

Element is not an element of the schema

I want to validate an XML file from my bank against an iso20022 XSD, but it fails claiming the first element (Document) is not an element of the scheme. I can see the 'Document' element defined in the XSD though.
I downloaded the XSD mentioned in the header from here: https://www.iso20022.org/documents/messages/camt/schemas/camt.052.001.06.zip Then I wrote a litte script to validate the XML file:
import xmlschema
schema = xmlschema.XMLSchema('camt.052.001.06.xsd')
schema.validate('minimal_example.xml')
(use 'pip install xmlschema' to install the xmlschema package)
minimal_example.xml is just the first element of my bank's XML file without any children.
<?xml version="1.0" ?>
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
The above script fails, claiming document was not an element of the XSD:
xmlschema.validators.exceptions.XMLSchemaValidationError: failed validating <Element 'Document' at 0x7fbda11e4138> with XMLSchema10(basename='camt.052.001.06.xsd', namespace='urn:iso:std:iso:20022:tech:xsd:camt.052.001.06'):
Reason: <Element 'Document' at 0x7fbda11e4138> is not an element of the schema
Instance:
<Document>
</Document>
But the document element is defined right at the top of camt.052.001.06.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Standards Editor (build:R1.6.5.6) on 2016 Feb 12 18:17:13, ISO 20022 version : 2013-->
<xs:schema xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
<xs:element name="Document" type="Document"/>
[...]
Why does the validation fail and how can I correct this?
The XSD has
targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06"
on the xs:schema element, indicating that it governs that namespace.
Your XML has a root element,
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
which places the Document in no namespace. To place it in the namespace governed by the XSD, change it to
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
or
<ns2:Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</ns2:Document>
See also xmlns, xmlns:xsi, xsi:schemaLocation, and targetNamespace?

Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console

I am trying to read an xml document using Beautiful Soup on Python 3.6.2, IPython 6.1.0, Windows 10, and I can't get the encoding right.
Here's my test xml, saved as a file in UTF8-encoding:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>
First check the XML using ElementTree:
import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'\t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'\t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)
The output is correct:
root:
info: ÜÜÜÜÜÜÜ
name - 愛よ
items:
item: "23Äßßß"
thing - ÖöÖö
Now read the same file with Beautiful Soup and pretty-print it:
import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())
Output:
<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
</body>
</html>
This is just wrong. Doing the call with explicite encoding specified bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8") doesn't change the result.
Doing
print(xml.original_encoding)
outputs
None
So Beautiful Soup is apparently unable to detect the original encoding even though the file is encoded in UTF8 (according to Notepad++) and the header information says UTF-8 as well, and I do have chardet installed as the doc recommends.
Am I making a mistake here? What could be causing this?
EDIT:
When I invoke the code without the html5lib I get this warning:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment,
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
EDIT 2:
As suggested in a comment I tried bs4.BeautifulSoup(ff,"html.parser"), but the problem remains.
Then I installed lxml and tried bs4.BeautifulSoup(ff,"lxml-xml"), still the same output.
What also strikes me as odd is that even when specifying an encoding like bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8') the value of xml.original_encoding is None contrary to what is written in the doc.
EDIT 3:
I put my xml contents into a string
xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
And used bs4.BeautifulSoup(xmlstring,"lxml-xml"), now I'm getting the correct output:
<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
So it seems something is wrong with the file after all.
Found the error, I have to specify the encoding when opening the file:
with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
As I'm on Python 3 I thought the value of encoding was UTF-8 by default, but it turned out it's system-dependent and on my system it's cp1252.

cant find specific node/element using python elementtree

I have the below XML document I am trying to parse. I just need to grab one node from the document. I need to get the serviceProfile text. I'm banging my head against the desk here... I am new to Python.
<?xml version='1.0' encoding='UTF-8'?>
<soapenv:Envelope
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<ns:getUserResponse
xmlns:ns="http://www.cisco.com/AXL/API/11.5">
<return>
<user uuid="{blbhbl-bhblb-kbhb}">
<firstName>fname</firstName>
<displayName>fname lname</displayName>
<middleName/>
<lastName>lname</lastName>
<userid>wooty</userid>
<password/>
<pin/>
<mailid>wooty#woot.com</mailid>
<department/>
<manager/>
<userLocale />
<associatedDevices/>
<primaryExtension/>
<associatedPc/>
<enableCti>false</enableCti>
<digestCredentials/>
<phoneProfiles/>
<defaultProfile/>
<presenceGroupName uuid="{sdsds-sdsds-sdsdsd-sdsdsd-sdsd}">Standard Presence group</presenceGroupName>
<subscribeCallingSearchSpaceName/>
<enableMobility>false</enableMobility>
<enableMobileVoiceAccess>false</enableMobileVoiceAccess>
<maxDeskPickupWaitTime>10000</maxDeskPickupWaitTime>
<remoteDestinationLimit>4</remoteDestinationLimit>
<associatedRemoteDestinationProfiles/>
<associatedTodAccess/>
<status>1</status>
<enableEmcc>false</enableEmcc>
<associatedCapfProfiles/>
<ctiControlledDeviceProfiles/>
<patternPrecedence />
<numericUserId />
<mlppPassword />
<customUserFields/>
<homeCluster>true</homeCluster>
<imAndPresenceEnable>true</imAndPresenceEnable>
<serviceProfile uuid="{dsdsdsd-sdsdsd-sdsd-sdsds-sdsds}">1 IM Presence Only</serviceProfile>
<lineAppearanceAssociationForPresences/>
<directoryUri>blah#wooty.com</directoryUri>
<telephoneNumber>555-555-5555</telephoneNumber>
<title/>
<mobileNumber/>
<homeNumber/>
<pagerNumber/>
<extensionsInfo/>
<selfService />
<userProfile/>
<calendarPresence>false</calendarPresence>
<ldapDirectoryName uuid="{sdsd-sdsdsd-sdsds-sdsds}">someinfo</ldapDirectoryName>
<userIdentity>blah#woot.com</userIdentity>
<nameDialing>blehWoot</nameDialing>
<ipccExtension/>
<convertUserAccount uuid="{sdsd-sdsdsd-sdsds-sdsds}">someinfo</convertUserAccount>
<enableUserToHostConferenceNow>false</enableUserToHostConferenceNow>
<attendeesAccessCode/>
</user>
</return>
</ns:getUserResponse>
</soapenv:Body>
</soapenv:Envelope>
Based on #danielHaley suggestions i created the following code to retrieve the node.
#read XML response and get service profile
tree = ET.ElementTree(ET.fromstring(response.content))
root = tree.getroot()
serviceprofile = root.find(".//serviceProfile").text
Worked great. thank you so much for your help.

How to merge only selected lines of two different xml files into single xml file using python?

I have a an xml files as "A.xml":
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Debug/exampled.lib" target="lib/Debug/exampled.lib" />
</files>
</package>
Another xml file "B.xml" as:
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Release/example.lib" target="lib/Release/example.lib" />
</files>
</package>
I only want to merge these two files in following way:
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://example">
<metadata>
<id>example</id>
</metadata>
<files>
<file src="lib/Debug/exampled.lib" target="lib/Debug/exampled.lib" />
<file src="lib/Release/example.lib" target="lib/Release/example.lib" />
</files>
</package>
So, kindly suggest how to merge only the files tag of the two files using python scripting.
This is a stackoverflow answer that mention many ways you can parse xml:
How do I parse XML in Python?
I suggest going with ElementTree as it's very simple to use and can solve your case with minimal code.
Seeing how the community is hostile toward questions that had little research backing it, I'm not going to post the full answer here and suggest that you take time and read up on ElementTree.

Categories

Resources