How to avoid double escape using XML - python

I'm using python to make a program which will have to write data in a XML tag of a specific file.
The line of data I'm willing to write is the following.
<Stream>XXXX-XXXX-XXXX-XXXX?p=0</Stream><URL>rtmp://a.rtmp.youtube.com/live2</URL>
But what I get in my XML file after writing is pretty different.
&lt;Stream&gt;XXXX-XXXX-XXXX-XXXX?p=0&lt;/Stream&gt;&lt;URL&gt;rtmp://a.rtmp.youtube.com/live2&lt;/URL&gt;
The &lt and &gt are here for purpose, and are NOT < and >. I need to keep this formatting but when I use the export as xml file, it replaces all the & by &
I use this code to write data in the xml file:
from lxml import etree as ET
Name_with_single_quote= """IF [Calculation_1] = 'Day-1' THEN [begintime] + 1
ELSEIF[Calculation_1] < 'Day-2' THEN [begintime] + 2
ELSEIF [Calculation_1] > "Day-3" THEN [begintime] + 3
ELSE [begintime]
END"""
Name_with_single_quote = Name_with_single_quote.replace("\n", "
").replace("<", "<").replace("'", "&apos;").replace(">",">").replace("\"", """)
Name_with_single_quote = str(Name_with_single_quote)
xml = """<?xml version="1.0"?>
<column role="dimension" type="nominal" name="[Calculation_1]" datatype="boolean" caption="">
<calculation formula=""/>
</column>"""
tree = ET.fromstring(xml)
formula = tree.find('.//calculation')
formula.set('formula', Name_with_single_quote)
from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(tree)).toprettyxml()
xmlstr = '\n'.join(list(filter(lambda x: len(x.strip()), xmlstr.split('\n'))))
with open('test_for_esc_result.xml', "w") as f:
f.write(xmlstr)

Related

How to indent my xml data which in xml file python [duplicate]

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True).
I have the following XML:
<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
<testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....
And I use it like this:
tree = etree.parse('original.xml')
root = tree.getroot()
...
# modifications
...
with open(FILE_NAME, "w") as f:
tree.write(f, pretty_print=True)
For me, this issue was not solved until I noticed this little tidbit here:
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Short version:
Read in the file with this command:
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)
That will "reset" the already existing indentation, allowing the output to generate it's own indentation correctly. Then pretty_print as normal:
>>> tree.write(<output_file_name>, pretty_print=True)
Well, according to the API docs, there is no method "write" in the lxml etree module. You've got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:
f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()
Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.
http://utidylib.berlios.de/
import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))
http://countergram.com/open-source/pytidylib
from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)
fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()
Here is an answer that is fixed to work with Python 3:
from lxml import etree
from sys import stdout
from io import BytesIO
parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)
where text is the xml code as a sequence of bytes.
I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).
tree = et.parse(xmlFile)
root = tree.getroot()
Of course - pretty print of lxml.etree is possible.
In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.
Here is it - a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation.
This gives a nicely prettified XML string:
from typing import Optional
import lxml.etree
def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
space = " "
indent_str = "\n" + level * space
element.text = strip_or_null(element.text)
if element.text:
element.text = f"{indent_str}{space}{element.text}"
num_children = len(element)
if num_children:
element.text = f"{element.text or ''}{indent_str}{space}"
for index, child in enumerate(element.iterchildren()):
is_last = index == num_children - 1
indent_lxml(child, level + 1, is_last)
elif element.text:
element.text += indent_str
tail_level = max(0, level - 1) if is_last_child else level
tail_indent = "\n" + tail_level * space
tail = strip_or_null(element.tail)
element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent
def strip_or_null(text: Optional[str]) -> Optional[str]:
if text is not None:
return text.strip() or None
It's decent fast, because it doesn't allocate any additional structures in memory and also traversing the tree - it visits each node only once, giving the best possible - O x N computational complexity.
It rearranges all the existing indentation "in place" in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).
Naturally, it also can be used with HTML parsed by lxml.
In order to use it, do something like that:
root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")
indent_lxml(root) # corrects indentation "in place"
result = lxml.etree.tostring(root, encoding="unicode")
print(result)
Which prints:
<xml>
<body>
<leaf1/>
<leaf2/>
</body>
</xml>

How to properly format xml file using lxml? [duplicate]

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True).
I have the following XML:
<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
<testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....
And I use it like this:
tree = etree.parse('original.xml')
root = tree.getroot()
...
# modifications
...
with open(FILE_NAME, "w") as f:
tree.write(f, pretty_print=True)
For me, this issue was not solved until I noticed this little tidbit here:
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Short version:
Read in the file with this command:
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)
That will "reset" the already existing indentation, allowing the output to generate it's own indentation correctly. Then pretty_print as normal:
>>> tree.write(<output_file_name>, pretty_print=True)
Well, according to the API docs, there is no method "write" in the lxml etree module. You've got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:
f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()
Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.
http://utidylib.berlios.de/
import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))
http://countergram.com/open-source/pytidylib
from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)
fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()
Here is an answer that is fixed to work with Python 3:
from lxml import etree
from sys import stdout
from io import BytesIO
parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)
where text is the xml code as a sequence of bytes.
I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).
tree = et.parse(xmlFile)
root = tree.getroot()
Of course - pretty print of lxml.etree is possible.
In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.
Here is it - a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation.
This gives a nicely prettified XML string:
from typing import Optional
import lxml.etree
def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
space = " "
indent_str = "\n" + level * space
element.text = strip_or_null(element.text)
if element.text:
element.text = f"{indent_str}{space}{element.text}"
num_children = len(element)
if num_children:
element.text = f"{element.text or ''}{indent_str}{space}"
for index, child in enumerate(element.iterchildren()):
is_last = index == num_children - 1
indent_lxml(child, level + 1, is_last)
elif element.text:
element.text += indent_str
tail_level = max(0, level - 1) if is_last_child else level
tail_indent = "\n" + tail_level * space
tail = strip_or_null(element.tail)
element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent
def strip_or_null(text: Optional[str]) -> Optional[str]:
if text is not None:
return text.strip() or None
It's decent fast, because it doesn't allocate any additional structures in memory and also traversing the tree - it visits each node only once, giving the best possible - O x N computational complexity.
It rearranges all the existing indentation "in place" in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).
Naturally, it also can be used with HTML parsed by lxml.
In order to use it, do something like that:
root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")
indent_lxml(root) # corrects indentation "in place"
result = lxml.etree.tostring(root, encoding="unicode")
print(result)
Which prints:
<xml>
<body>
<leaf1/>
<leaf2/>
</body>
</xml>

xml file to csv file python script

I need a python script for extract data from xml file
I have a xml file as shoen below:
<software>
<name>Update Image</name>
<Build>22.02</Build>
<description>Firmware for Delta-M Series </description>
<CommonImages> </CommonImages>
<ModelBasedImages>
<ULT>
<CNTRL_0>
<file type="UI_APP" ver="2.35" crc="1234"/>
<file type="MainFW" ver="5.01" crc="5678"/>
<SIZE300>
<file type="ParamTableDB" ver="1.1.4" crc="9101"/>
</SIZE300>
</CNTRL_0>
<CNTRL_2>
<file type="UI_APP" ver="2.35" crc="1234"/>
<file type="MainFW" ver="5.01" crc="9158"/>
</CNTRL_2>
</ULT>
</ModelBasedImages>
</software>
I want the data in table format like:
type ver crc
UI_APP 2.35 1234
MainFW 5.01 5678
ParamTableDB 1.1.4 9101
UI_APP 2.35 1234
MainFW 5.01 9158
Extract into any type of file csv/doc....
I tried this code:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("Build_40.01 (copy).xml")
root = tree.getroot()
# open a file for writing
Resident_data = open('ResidentData.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(Resident_data)
resident_head = []
count = 0
for member in root.findall('file'):
resident = []
address_list = []
if count == 0:
name = member.find('type').tag
resident_head.append(name)
ver = member.find('ver').tag
resident_head.append(ver)
crc = member.find('crc').tag
resident_head.append(crc)
csvwriter.writerow(resident_head)
count = count + 1
name = member.find('type').text
resident.append(name)
ver = member.find('ver').text
resident.append(ver)
crc = member.find('crc').text
resident.append(crc)
csvwriter.writerow(resident)
Resident_data.close()
Thanks in advance
edited:xml code updated.
Use the xpath expression .//file to find all <file> elements in the XML document, and then use each element's attributes to populate the CSV file through a csv.DictWriter:
import csv
import xml.etree.ElementTree as ET
tree = ET.parse("Build_40.01 (copy).xml")
root = tree.getroot()
with open('ResidentData.csv', 'w') as f:
w = csv.DictWriter(f, fieldnames=('type', 'ver', 'crc'))
w.writerheader()
w.writerows(e.attrib for e in root.findall('.//file'))
For your sample input the output CSV file will look like this:
type,ver,crc
UI_APP,2.35,1234
MainFW,5.01,5678
ParamTableDB,1.1.4,9101
UI_APP,2.35,1234
MainFW,5.01,9158
which uses the default delimiter (comma) for a CSV file. You can change the delimiter using the delimiter=' ' option to DictWriter(), however, you will not be able to obtain the same formatting as your sample output, which appears to use fixed width fields (but you might get away with using tab as the delimiter).

Convert CSV document to XML

I know the question is redundant but I tried all the Python code that I found and modified for my file but they did not work. I need to find a way to convert my file myData.csv in to a XML format file which can be read by a navigator.
I just started to learn Python this month so I'm a beginner. This is my code:
#! usr/bin/python
# -*- coding: utf-8 -*-
import csv, sys, os
from lxml import etree
csvFile = 'myData.csv' # création de la variable pour le fichier csv
reader= csv.reader(open(csvFile), delimiter=';', quoting=csv.QUOTE_NONE) # création d'une variable reader à qui on renvoie le tableau csv
print "<data>"
for record in reader:
if reader.line_num == 1:
header = record
else:
innerXml = ""
dontShow = False
type = ""
for i, field in enumerate(record):
innerXml += "<%s>" % header[i].lower() + field + "</%s>" % header[i].lower()
if i == 1 and field == "0":
type = "Next"
elif type == "" and i == 3 and field == "0":
type = "Next"
elif type == "" and i == 3 and field != "0":
type = "film"
if i == 1 and field == "X":
dontShow = True
if dontShow == False:
xml = "<%s>" % type
xml += innerXml
xml += "</%s>" % type
print xml
print "</data>"
Consider building your XML with dedicated DOM objects and not a concatenation of strings which you can do with the lxml module. Using methods such as Element(), SubElement(), etc. you can iteratively build XML tree from reading CSV data:
import csv
import lxml.etree as ET
headers = ['Titre', 'Realisateur', 'Date_Debut_Evenement', 'Date_Fin_Evenement', 'Cadre',
'Lieu', 'Adresse', 'Arrondissement', 'Adresse_complète', 'Geo_Coordinates']
# INITIALIZING XML FILE
root = ET.Element('root')
# READING CSV FILE AND BUILD TREE
with open('myData.csv') as f:
next(f) # SKIP HEADER
csvreader = csv.reader(f)
for row in csvreader:
data = ET.SubElement(root, "data")
for col in range(len(headers)):
node = ET.SubElement(data, headers[col]).text = str(row[col])
# SAVE XML TO FILE
tree_out = (ET.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8"))
# OUTPUTTING XML CONTENT TO FILE
with open('Output.xml', 'wb') as f:
f.write(tree_out)
Output
<?xml version='1.0' encoding='UTF-8'?>
<root>
<data>
<Titre>1</Titre>
<Realisateur>BUS PALLADIUM</Realisateur>
<Date_Debut_Evenement>CHRISTOPHER THOMPSON</Date_Debut_Evenement>
<Date_Fin_Evenement>21 mai 2009</Date_Fin_Evenement>
<Cadre>21 mai 2009</Cadre>
<Lieu>EXTERIEUR</Lieu>
<Adresse>PLACE</Adresse>
<Arrondissement>PIGALLE</Arrondissement>
<Adresse_complète>75018</Adresse_complète>
<Geo_Coordinates>PLACE PIGALLE 75018 Paris France</Geo_Coordinates>
</data>
<data>
<Titre>2</Titre>
<Realisateur>LES INVITES DE MON PERE</Realisateur>
<Date_Debut_Evenement>ANNE LE NY</Date_Debut_Evenement>
<Date_Fin_Evenement>20 mai 2009</Date_Fin_Evenement>
<Cadre>20 mai 2009</Cadre>
<Lieu>DOMAINE PUBLIC</Lieu>
<Adresse>SQUARE</Adresse>
<Arrondissement>DU CLIGNANCOURT</Arrondissement>
<Adresse_complète>75018</Adresse_complète>
<Geo_Coordinates>SQUARE DU CLIGNANCOURT 75018 Paris France</Geo_Coordinates>
</data>
<data>
<Titre>3</Titre>
<Realisateur>DEMAIN, A L'AUBE</Realisateur>
<Date_Debut_Evenement>GAEL CABOUAT</Date_Debut_Evenement>
<Date_Fin_Evenement>17 avril 2009</Date_Fin_Evenement>
<Cadre>17 avril 2009</Cadre>
<Lieu>EXTERIEUR</Lieu>
<Adresse>RUE</Adresse>
<Arrondissement>QUINCAMPOIX</Arrondissement>
<Adresse_complète>75004</Adresse_complète>
<Geo_Coordinates>RUE QUINCAMPOIX 75004 Paris France</Geo_Coordinates>
</data>
...
(posted as an answer so I can show a code block)
There are a lot of picky details when writing XML. In Python, you should probably use some version of ElementTree to help with that. One good tutorial is Creating XML Documents. Quoting from there:
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
top = Element('top')
comment = Comment('Generated for PyMOTW')
top.append(comment)
child = SubElement(top, 'child')
child.text = 'This child contains text.'
child_with_tail = SubElement(top, 'child_with_tail')
child_with_tail.text = 'This child has regular text.'
child_with_tail.tail = 'And "tail" text.'
child_with_entity_ref = SubElement(top, 'child_with_entity_ref')
child_with_entity_ref.text = 'This & that'
print(tostring(top))
If you use this as an example of how to create a tree of XML elements, you should be able to translate your code into the XML structure you need.
Importing pandas and saving file name:
import pandas as pd
csvFile = 'myData.csv'
The following will read CSV into a pandas data frame, then convert to XML.
df = pd.read_csv(path)
df_xml = df.to_xml()
The below code will create a new file and then save the XML data to a file named "csv2xml"
f = open("csv2xml.xml", "w")
f.write(df_xml)
f.close()

Normalize XML text node in Python minidom

I want to insert this string:
No, on the 5<Font Script="super">th</Font>
as a Text Node in XML by xml.dom.minidom createTextNode(), however, after I writexml() to a file, the signs:
< > "
turns to:
No, on the 5<Font Script="super">th</Font>
How can I avoid this? Thanks.
A part of my code:
impl = minidom.getDOMImplementation()
dom = impl.createDocument(None, None, None)
TextTextNode = dom.createTextNode(text.decode("utf-8"))
Text = dom.createElement("Text")
Text.appendChild(TextTextNode)
fileToWrite = codecs.open(output, 'w', encoding='utf-8')
dom.writexml(fileToWrite, indent=" ", addindent=" ", newl="\n", encoding='utf-8')
fileToWrite.close()
There is a sample for this by the cinecanvase specification:
<Text HAlign=”left” HPosition=”10.2” VAlign=”bottom” VPosition=”10.0”> This <Font Script=”super”>word </Font>is superscript </Text >
I need insert the <Font>..</Font> into another element, the .
I'm not familiar with that format, but that thing looks like an XML node. Try this:
from xml.dom import minidom
import codecs
output = "test.xml"
text="No, on the 5"
impl = minidom.getDOMImplementation()
dom = impl.createDocument(None, None, None)
FontNode = dom.createElement("Font")
FontNode.setAttribute('Script', 'super')
FontNode.appendChild(dom.createTextNode('th'))
Text = dom.createElement("Text")
TextTextNode = dom.createTextNode(text.decode("utf-8"))
Text.appendChild(TextTextNode)
Text.appendChild(FontNode)
fileToWrite = codecs.open(output, 'w', encoding='utf-8')
Text.writexml(fileToWrite, indent=" ", addindent=" ", newl="\n")
fileToWrite.close()
That outputs:
<Text>
No, on the 5
<Font Script="super">th</Font>
</Text>
Be aware that what you want to write a tree in a file (when you call writexml) you need to call the writexml method with your XML's tree root (you were calling it with dom, not with your root node)

Categories

Resources