Problem with newlines when I use toprettyxml() - python

I'm currently using the toprettyxml() function of the xml.dom module in a Python script and I'm having some trouble with the newlines.
If don't use the newl parameter or if I use toprettyxml(newl='\n') it displays several newlines instead of only one.
For instance
f = open(filename, 'w')
f.write(dom1.toprettyxml(encoding='UTF-8'))
f.close()
displayed:
<params>
<param name="Level" value="#LEVEL#"/>
<param name="Code" value="281"/>
</params>
Does anyone know where the problem comes from and how I can use it?
FYI I'm using Python 2.6.1

I found another great solution :
f = open(filename, 'w')
dom_string = dom1.toprettyxml(encoding='UTF-8')
dom_string = os.linesep.join([s for s in dom_string.splitlines() if s.strip()])
f.write(dom_string)
f.close()
Above solution basically removes the unwanted newlines from the dom_string which are generated by toprettyxml().
Inputs taken from -> What's a quick one-liner to remove empty lines from a python string?

toprettyxml() is quite awful. It is not a matter of Windows and '\r\n'. Trying any string as the newlparameter shows that too many lines are being added. Not only that, but other blanks (that may cause you problems when a machine reads the xml) are also added.
Some workarounds available at
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace

toprettyxml(newl='') works for me on Windows.

This is a pretty old question but I guess I know what the problem is:
Minidoms pretty print has a pretty straight forward method. It just adds the characters that you specified as arguments. That means, it will duplicate the characters if they already exist.
E.g. if you parse an XML file that looks like this:
<parent>
<child>
Some text
</child>
</parent>
there are already newline characters and indentions within the dom. Those are taken as text nodes by minidom and are still there when you parse it it into a dom object.
If you now proceed to convert the dom object into an XML string, those text nodes will still be there. Meaning new line characters and indent tabs are still remaining. Using pretty print now, will just add more new lines and more tabs. That's why in this case not using pretty print at all or specifying newl='' will result in the wanted output.
However, you generate the dom in your script, the text nodes will not be there, therefore pretty printing with newl='\r\n' and/or addindent='\t' will turn out quite pretty.
TL;DR Indents and newlines remain from parsing and pretty print just adds more

If you don't mind installing new packages, try beautifulsoup. I had very good experiences with its xml prettyfier.

Following function worked for my problem.
I had to use python 2.7 and i was not allowed to install any 3rd party additional package.
The crux of implementation is as follows:
Use dom.toprettyxml()
Remove all white spaces
Add new lines and tabs as per your requirement.
~
import os
import re
import xml.dom.minidom
import sys
class XmlTag:
opening = 0
closing = 1
self_closing = 2
closing_tag = "</"
self_closing_tag = "/>"
opening_tag = "<"
def to_pretty_xml(xml_file_path):
pretty_xml = ""
space_or_tab_count = " " # Add spaces or use \t
tab_count = 0
last_tag = -1
dom = xml.dom.minidom.parse(xml_file_path)
# get pretty-printed version of input file
string_xml = dom.toprettyxml(' ', os.linesep)
# remove version tag
string_xml = string_xml.replace("<?xml version=\"1.0\" ?>", '')
# remove empty lines and spaces
string_xml = "".join(string_xml.split())
# move each tag to new line
string_xml = string_xml.replace('>', '>\n')
for line in string_xml.split('\n'):
if line.__contains__(XmlTag.closing_tag):
# For consecutive closing tags decrease the indentation
if last_tag == XmlTag.closing:
tab_count = tab_count - 1
# Move closing element to next line
if last_tag == XmlTag.closing or last_tag == XmlTag.self_closing:
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)
pretty_xml = pretty_xml + line
last_tag = XmlTag.closing
elif line.__contains__(XmlTag.self_closing_tag):
# Print self closing on next line with one indentation from parent node
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * (tab_count+1)) + line
last_tag = XmlTag.self_closing
elif line.__contains__(XmlTag.opening_tag):
# For consecutive opening tags increase the indentation
if last_tag == XmlTag.opening:
tab_count = tab_count + 1
# Move opening element to next line
if last_tag == XmlTag.opening or last_tag == XmlTag.closing:
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)
pretty_xml = pretty_xml + line
last_tag = XmlTag.opening
return pretty_xml
pretty_xml = to_pretty_xml("simple.xml")
with open("pretty.xml", 'w') as f:
f.write(pretty_xml)

This gives me nice XML on Python 3.6, haven't tried on Windows:
dom = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml(newl='').replace("\n\n", "\n")

Are you viewing the resulting file on Windows? If so, try using toprettyxml(newl='\r\n').

Related

Python: Modifying text file, deleting substring

I have a text file which has a line like this -
time time B2CAT_INLET_T\CAN-Monitoring:1 B1CAT_MIDBED_T\CAN-Monitoring:1 B1CAT_INLET_T\CAN-Monitoring:1 B1CAT_OUTLET_T\CAN-Monitoring:1 time APEPFRPP\CCP:1 KDFILRAW\CCP:1
When I read it using
lines = txtfile.readlines()
I get lines =
'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
So the '\' show as 'double \' and the tab shows as '\t'
From this I want to delete all instances of '\CAN-Monitoring:1' and '\CCP:1' and preserve the tabs as they are.
I have a code that walks through each element of 'lines' and gets index of each 'double \' and '\t'
Then I tried to use lines.replace(index of 'double \':index of '\t','')
But this does not seem to work as I want.
Following is my code so far:
# Reading from text file
txtfile = open('filename.txt', 'r')
lines = txtfile.readlines()
textToModify = lines
# This gives indices of all '\\' and '\t'
doubleslash = []
tab = []
for i, item in enumerate(textToModify):
if textToModify[i] == '\\':
doubleslash.append(i)
for i, item in enumerate(textToModify):
if textToModify[i] == '\t':
tab.append(i)
# Should find text beginning with '\\' until '\t' only
itemSlashBegin = []
itemTabBegin = []
for itemSlash in doubleslash:
for itemTab in tab:
if itemSlash < itemTab:
break
itemSlashBegin.append(itemSlash)
itemTabBegin.append(itemTab)
# Trying to replace '\\'text'\t' in the original text
for i,item in enumerate(itemSlashBegin):
ModifiedTxt = textToModify.replace([item:itemTabBegin[i]],"")
I am sure there is a more elegant way too; but I cannot find it.
Please give me some solution.
Thank you
R
If you don't want to import anything then use this
f = 'time\ttime\tB2CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_MIDBED_T\\CAN-Monitoring:1\tB1CAT_INLET_T\\CAN-Monitoring:1\tB1CAT_OUTLET_T\\CAN-Monitoring:1\ttime\tAPEPFRPP\\CCP:1\tKDFILRAW\\CCP:1\t\t'
s= ('\CAN-Monitoring:1','\CCP:1')
for i in s:
f=f.replace(i, '')
print(f)
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW
Just use re.sub here:
out = re.sub(r'\\CAN-Monitoring:1|\\CCP:1', '', inp)
print(out)
This prints:
time time B2CAT_INLET_T B1CAT_MIDBED_T B1CAT_INLET_T B1CAT_OUTLET_T time APEPFRPP KDFILRAW
Note that double backslash and \t are simply how a literal backslash and tab character are represented in a Python string.

REGEX - Finding an specific XML tag and parsing through it

My xml looks like the following :
<example>
<Test_example>Author%5773637864827/Testing-75873874hdueu47.jpg</Test_example>
<Test_example>Auth0r%5773637864827/Testing245-75873874hdu6543u47.ts</Test_example>
<newtag>
This XML has 100 lines and i am interested in the tag "<Test_example>". In this tag I want to remove everything until it sees a / and when it sees a - remove everything until it sees the full stop.
End result should be
<Test_example>Testing.jpg</Test_example>
<Test_example>Testing245.ts</Test_example>
I am a beginner and would love some help on this. I need a regex soloution. The code i have running before this is a find and replace like follows.
new = open('test.xml')
with open('test.xml', 'r') as f:
onw = f.read().replace('new:', 'ext:')
Based on your sample data I came up with the following regex and this is how I tested it.
import re
example_string = """<example>
<Test_example>Author%5773637864827/Testing-75873874hdueu47.jpg</Test_example>
<Test_example>Auth0r%5773637864827/Testing245-75873874hdu6543u47.ts</Test_example>
<newtag>"""
my_list = example_string.split('\n')
my_regex = re.compile('(<Test_example>)\S+%\d+/(\S+)-\S+(\.\S+)(</Test_example>)')
for line in my_list:
if re.search(my_regex, line):
match = re.search(my_regex, line)
print(match.group(1) + match.group(2) + match.group(3) + match.group(4))
Output:
<Test_example>Testing.jpg</Test_example>
<Test_example>Testing245.ts</Test_example>

Change tab spacing in python lxml prettyprint

I have a small script that creates an xml document and using the prettyprint=true it makes a correctly formatted xml document. However, the tab indents are 2 spaces, and I am wondering if there is a way to change this to 4 spaces (I think it looks better with 4 spaces). Is there a simple way to implement this?
Code snippet:
doc = lxml.etree.SubElement(root, 'dependencies')
for depen in dependency_list:
dependency = lxml.etree.SubElement(doc, 'dependency')
lxml.etree.SubElement(dependency, 'groupId').text = depen.group_id
lxml.etree.SubElement(dependency, 'artifactId').text = depen.artifact_id
lxml.etree.SubElement(dependency, 'version').text = depen.version
if depen.scope == 'provided' or depen.scope == 'test':
lxml.etree.SubElement(dependency, 'scope').text = depen.scope
exclusions = lxml.etree.SubElement(dependency, 'exclusions')
exclusion = lxml.etree.SubElement(exclusions, 'exclusion')
lxml.etree.SubElement(exclusion, 'groupId').text = '*'
lxml.etree.SubElement(exclusion, 'artifactId').text = '*'
tree.write('explicit-pom.xml' , pretty_print=True)
If someone is still trying to achieve this, it can be done using etree.indent() method in lxml 4.5 -
>>> etree.indent(root, space=" ")
>>> print(etree.tostring(root))
<root>
<a>
<b/>
</a>
</root>
https://lxml.de/tutorial.html#serialisation
This doesn't seem to be possible by the python lxml API.
A possible solution for tab spacing would be:
def prettyPrint(someRootNode):
lines = lxml.etree.tostring(someRootNode, encoding="utf-8", pretty_print=True).decode("utf-8").split("\n")
for i in range(len(lines)):
line = lines[i]
outLine = ""
for j in range(0, len(line), 2):
if line[j:j + 2] == " ":
outLine += "\t"
else:
outLine += line[j:]
break
lines[i] = outLine
return "\n".join(lines)
Please note that this is not very efficient. High efficiency can only be achieved if this functionality is natively implemented within the lxml C code.

python, string.replace() and \n

(Edit: the script seems to work for others here trying to help. Is it because I'm running python 2.7? I'm really at a loss...)
I have a raw text file of a book I am trying to tag with pages.
Say the text file is:
some words on this line,
1
DOCUMENT TITLE some more words here too.
2
DOCUMENT TITLE and finally still more words.
I am trying to use python to modify the example text to read:
some words on this line,
</pg>
<pg n=2>some more words here too,
</pg>
<pg n=3>and finally still more words.
My strategy is to load the text file as a string. Build search-for and a replace-with strings corresponding to a list of numbers. Replace all instances in string, and write to a new file.
Here is the code I've written:
from sys import argv
script, input, output = argv
textin = open(input,'r')
bookstring = textin.read()
textin.close()
pages = []
x = 1
while x<400:
pages.append(x)
x = x + 1
pagedel = "DOCUMENT TITLE"
for i in pages:
pgdel = "%d\n%s" % (i, pagedel)
nplus = i + 1
htmlpg = "</p>\n<p n=%d>" % nplus
bookstring = bookstring.replace(pgdel, htmlpg)
textout = open(output, 'w')
textout.write(bookstring)
textout.close()
print "Updates to %s printed to %s" % (input, output)
The script runs without error, but it also makes no changes whatsoever to the input text. It simply reprints it character for character.
Does my mistake have to do with the hard return? \n? Any help greatly appreciated.
In python, strings are immutable, and thus replace returns the replaced output instead of replacing the string in place.
You must do:
bookstring = bookstring.replace(pgdel, htmlpg)
You've also forgot to call the function close(). See how you have textin.close? You have to call it with parentheses, like open:
textin.close()
Your code works for me, but I might just add some more tips:
Input is a built-in function, so perhaps try renaming that. Although it works normally, it might not for you.
When running the script, don't forget to put the .txt ending:
$ python myscript.py file1.txt file2.txt
Make sure when testing your script to clear the contents of file2.
I hope these help!
Here's an entirely different approach that uses re(import the re module for this to work):
doctitle = False
newstr = ''
page = 1
for line in bookstring.splitlines():
res = re.match('^\\d+', line)
if doctitle:
newstr += '<pg n=' + str(page) + '>' + re.sub('^DOCUMENT TITLE ', '', line)
doctitle = False
elif res:
doctitle = True
page += 1
newstr += '\n</pg>\n'
else:
newstr += line
print newstr
Since no one knows what's going on, it's worth a try.

Converting HTML list (<li>) to tabs (i.e. indentation)

Have worked in dozens of languages but new to Python.
My first (maybe second) question here, so be gentle...
Trying to efficiently convert HTML-like markdown text to wiki format (specifically, Linux Tomboy/GNote notes to Zim) and have gotten stuck on converting lists.
For a 2-level unordered list like this...
First level
Second level
Tomboy/GNote uses something like...
<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>
However, the Zim personal wiki wants that to be...
* First level
* Second level
... with leading tabs.
I've explored the regex module functions re.sub(), re.match(), re.search(), etc. and found the cool Python ability to code repeating text as...
count * "text"
Thus, it looks like there should be a way to do something like...
newnote = re.sub("<list>", LEVEL * "\t", oldnote)
Where LEVEL is the ordinal (occurrance) of <list> in the note. It would thus be 0 for the first <list> incountered, 1 for the second, etc.
LEVEL would then be decremented each time </list> was encountered.
<list-item> tags are converted to the asterisk for the bullet (preceded by newline as appropriate) and </list-item> tags dropped.
Finally... the question...
How do I get the value of LEVEL and use it as a tabs multiplier?
You should really use an xml parser to do this, but to answer your question:
import re
def next_tag(s, tag):
i = -1
while True:
try:
i = s.index(tag, i+1)
except ValueError:
return
yield i
a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
a = a.replace("<list-item>", "* ")
for LEVEL, ind in enumerate(next_tag(a, "<list>")):
a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)
a = a.replace("</list-item>", "")
a = a.replace("</list>", "")
print a
This will work for your example, and your example ONLY. Use an XML parser. You can use xml.dom.minidom (it's included in Python (2.7 at least), no need to download anything):
import xml.dom.minidom
def parseList(el, lvl=0):
txt = ""
indent = "\t" * (lvl)
for item in el.childNodes:
# These are the <list-item>s: They can have text and nested <list> tag
for subitem in item.childNodes:
if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
# This is the text before the next <list> tag
txt += "\n" + indent + "* " + subitem.nodeValue
else:
# This is the next list tag, its indent level is incremented
txt += parseList(subitem, lvl=lvl+1)
return txt
def parseXML(s):
doc = xml.dom.minidom.parseString(s)
return parseList(doc.firstChild)
a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)
Output:
* First level
* Second level
* Second level 2
* Third level
Use Beautiful soup , it allows you to iterate in the tags even if they are customs. Very pratical for doing this type of operation
from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')] for list_tag in soup('list')]
Output : [[u'First level'], [u'Second level']]
I used a nested list comprehension but you can use a nested for loop
for list_tag in soup('list'):
for item in list_tag('list-item'):
print item.text
I hope that helps you.
In my example I used BeautifulSoup 3 but the example should work with BeautifulSoup4 but only the import change.
from bs4 import BeautifulSoup

Categories

Resources