I have a small script that creates an xml document and using the prettyprint=true it makes a correctly formatted xml document. However, the tab indents are 2 spaces, and I am wondering if there is a way to change this to 4 spaces (I think it looks better with 4 spaces). Is there a simple way to implement this?
Code snippet:
doc = lxml.etree.SubElement(root, 'dependencies')
for depen in dependency_list:
dependency = lxml.etree.SubElement(doc, 'dependency')
lxml.etree.SubElement(dependency, 'groupId').text = depen.group_id
lxml.etree.SubElement(dependency, 'artifactId').text = depen.artifact_id
lxml.etree.SubElement(dependency, 'version').text = depen.version
if depen.scope == 'provided' or depen.scope == 'test':
lxml.etree.SubElement(dependency, 'scope').text = depen.scope
exclusions = lxml.etree.SubElement(dependency, 'exclusions')
exclusion = lxml.etree.SubElement(exclusions, 'exclusion')
lxml.etree.SubElement(exclusion, 'groupId').text = '*'
lxml.etree.SubElement(exclusion, 'artifactId').text = '*'
tree.write('explicit-pom.xml' , pretty_print=True)
If someone is still trying to achieve this, it can be done using etree.indent() method in lxml 4.5 -
>>> etree.indent(root, space=" ")
>>> print(etree.tostring(root))
<root>
<a>
<b/>
</a>
</root>
https://lxml.de/tutorial.html#serialisation
This doesn't seem to be possible by the python lxml API.
A possible solution for tab spacing would be:
def prettyPrint(someRootNode):
lines = lxml.etree.tostring(someRootNode, encoding="utf-8", pretty_print=True).decode("utf-8").split("\n")
for i in range(len(lines)):
line = lines[i]
outLine = ""
for j in range(0, len(line), 2):
if line[j:j + 2] == " ":
outLine += "\t"
else:
outLine += line[j:]
break
lines[i] = outLine
return "\n".join(lines)
Please note that this is not very efficient. High efficiency can only be achieved if this functionality is natively implemented within the lxml C code.
Related
Hi I could really use some help on a python project that I'm working on. Basically I have a list of banned words and I must go through a .txt file and search for these specific words and change them from their original form to a ***.
text_file = open('filename.txt','r')
text_file_read = text_file.readlines()
banned_words = ['is','activity', 'one']
words = []
i = 0
while i < len(text_file_read):
words.append(text_file_read[i].strip().lower().split())
i += 1
i = 0
while i < len(words):
if words[i] in banned_words:
words[i] = '*'*len(words[i])
i += 1
i = 0
text_file_write = open('filename.txt', 'w')
while i < len(text_file_read):
print(' '.join(words[i]), file = text_file_write)
i += 1
The expected output would be:
This **
********
***?
However its:
This is
activity
one?
Any help is greatly appreciated! I'm also trying not to use external libraries as well
I cannot solve this for you (haven't touched python in a while), but the best debugging tip I can offer is: print everything. Take the first loop, print every iteration, or print what "words" is afterwards. It will give you an insight on what's going wrong, and once you know what is working in an unexpected way, you can search how you can fix it.
Also, if you're just starting, avoid concatenating methods. It ends up a bit unreadable, and you can't see what each method is doing. In my opinion at least, it's better to have 30 lines of readable and easy-to-understand code, than 5 that take some brain power to understand.
Good luck!
A simpler way if you just need to print it
banned_words = ['is','activity', 'one']
output = ""
f = open('filename.txt','r')
for line in f:
for word in line.rsplit():
if not word in banned_words:
output += word + " "
else:
output += "*"*len(word) + " "
output += "\n"
print(output)
from docx import Document
alphaDic = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','!','?','.','~',',','(',')','$','-',':',';',"'",'/']
doc = Document('realexample.docx')
docIndex = 0
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
removeExcessSpaces = " ".join(translation.split())
if removeExcessSpaces != '':
doc.paragraphs[docIndex].text = removeExcessSpaces
else:
delete_paragraph(doc.paragraphs[docIndex])
docIndex -=1 # go one step back in the loop because of the deleted index
docIndex +=1
So the test document looks like this
Hello
你好
Good afternoon
朋友们
Good evening
晚上好
And I'm trying to achieve this result below.
你好
朋友们
晚上好
Right now the code removes all empty paragraphs and excessive spaces and does this, so I'm kinda stuck here. I only want to erase the line breaks that were caused from the English words.
你好
朋友们
晚上好
what you can do is looking for english words, once you find the english word "WORD", append it with "\n" and then remove this new result "WORD\n" from the document. The way you append strings in python is with + sign. Just do "WORD" + "\n"
(Edit: the script seems to work for others here trying to help. Is it because I'm running python 2.7? I'm really at a loss...)
I have a raw text file of a book I am trying to tag with pages.
Say the text file is:
some words on this line,
1
DOCUMENT TITLE some more words here too.
2
DOCUMENT TITLE and finally still more words.
I am trying to use python to modify the example text to read:
some words on this line,
</pg>
<pg n=2>some more words here too,
</pg>
<pg n=3>and finally still more words.
My strategy is to load the text file as a string. Build search-for and a replace-with strings corresponding to a list of numbers. Replace all instances in string, and write to a new file.
Here is the code I've written:
from sys import argv
script, input, output = argv
textin = open(input,'r')
bookstring = textin.read()
textin.close()
pages = []
x = 1
while x<400:
pages.append(x)
x = x + 1
pagedel = "DOCUMENT TITLE"
for i in pages:
pgdel = "%d\n%s" % (i, pagedel)
nplus = i + 1
htmlpg = "</p>\n<p n=%d>" % nplus
bookstring = bookstring.replace(pgdel, htmlpg)
textout = open(output, 'w')
textout.write(bookstring)
textout.close()
print "Updates to %s printed to %s" % (input, output)
The script runs without error, but it also makes no changes whatsoever to the input text. It simply reprints it character for character.
Does my mistake have to do with the hard return? \n? Any help greatly appreciated.
In python, strings are immutable, and thus replace returns the replaced output instead of replacing the string in place.
You must do:
bookstring = bookstring.replace(pgdel, htmlpg)
You've also forgot to call the function close(). See how you have textin.close? You have to call it with parentheses, like open:
textin.close()
Your code works for me, but I might just add some more tips:
Input is a built-in function, so perhaps try renaming that. Although it works normally, it might not for you.
When running the script, don't forget to put the .txt ending:
$ python myscript.py file1.txt file2.txt
Make sure when testing your script to clear the contents of file2.
I hope these help!
Here's an entirely different approach that uses re(import the re module for this to work):
doctitle = False
newstr = ''
page = 1
for line in bookstring.splitlines():
res = re.match('^\\d+', line)
if doctitle:
newstr += '<pg n=' + str(page) + '>' + re.sub('^DOCUMENT TITLE ', '', line)
doctitle = False
elif res:
doctitle = True
page += 1
newstr += '\n</pg>\n'
else:
newstr += line
print newstr
Since no one knows what's going on, it's worth a try.
Have worked in dozens of languages but new to Python.
My first (maybe second) question here, so be gentle...
Trying to efficiently convert HTML-like markdown text to wiki format (specifically, Linux Tomboy/GNote notes to Zim) and have gotten stuck on converting lists.
For a 2-level unordered list like this...
First level
Second level
Tomboy/GNote uses something like...
<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>
However, the Zim personal wiki wants that to be...
* First level
* Second level
... with leading tabs.
I've explored the regex module functions re.sub(), re.match(), re.search(), etc. and found the cool Python ability to code repeating text as...
count * "text"
Thus, it looks like there should be a way to do something like...
newnote = re.sub("<list>", LEVEL * "\t", oldnote)
Where LEVEL is the ordinal (occurrance) of <list> in the note. It would thus be 0 for the first <list> incountered, 1 for the second, etc.
LEVEL would then be decremented each time </list> was encountered.
<list-item> tags are converted to the asterisk for the bullet (preceded by newline as appropriate) and </list-item> tags dropped.
Finally... the question...
How do I get the value of LEVEL and use it as a tabs multiplier?
You should really use an xml parser to do this, but to answer your question:
import re
def next_tag(s, tag):
i = -1
while True:
try:
i = s.index(tag, i+1)
except ValueError:
return
yield i
a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
a = a.replace("<list-item>", "* ")
for LEVEL, ind in enumerate(next_tag(a, "<list>")):
a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)
a = a.replace("</list-item>", "")
a = a.replace("</list>", "")
print a
This will work for your example, and your example ONLY. Use an XML parser. You can use xml.dom.minidom (it's included in Python (2.7 at least), no need to download anything):
import xml.dom.minidom
def parseList(el, lvl=0):
txt = ""
indent = "\t" * (lvl)
for item in el.childNodes:
# These are the <list-item>s: They can have text and nested <list> tag
for subitem in item.childNodes:
if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
# This is the text before the next <list> tag
txt += "\n" + indent + "* " + subitem.nodeValue
else:
# This is the next list tag, its indent level is incremented
txt += parseList(subitem, lvl=lvl+1)
return txt
def parseXML(s):
doc = xml.dom.minidom.parseString(s)
return parseList(doc.firstChild)
a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)
Output:
* First level
* Second level
* Second level 2
* Third level
Use Beautiful soup , it allows you to iterate in the tags even if they are customs. Very pratical for doing this type of operation
from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')] for list_tag in soup('list')]
Output : [[u'First level'], [u'Second level']]
I used a nested list comprehension but you can use a nested for loop
for list_tag in soup('list'):
for item in list_tag('list-item'):
print item.text
I hope that helps you.
In my example I used BeautifulSoup 3 but the example should work with BeautifulSoup4 but only the import change.
from bs4 import BeautifulSoup
I'm currently using the toprettyxml() function of the xml.dom module in a Python script and I'm having some trouble with the newlines.
If don't use the newl parameter or if I use toprettyxml(newl='\n') it displays several newlines instead of only one.
For instance
f = open(filename, 'w')
f.write(dom1.toprettyxml(encoding='UTF-8'))
f.close()
displayed:
<params>
<param name="Level" value="#LEVEL#"/>
<param name="Code" value="281"/>
</params>
Does anyone know where the problem comes from and how I can use it?
FYI I'm using Python 2.6.1
I found another great solution :
f = open(filename, 'w')
dom_string = dom1.toprettyxml(encoding='UTF-8')
dom_string = os.linesep.join([s for s in dom_string.splitlines() if s.strip()])
f.write(dom_string)
f.close()
Above solution basically removes the unwanted newlines from the dom_string which are generated by toprettyxml().
Inputs taken from -> What's a quick one-liner to remove empty lines from a python string?
toprettyxml() is quite awful. It is not a matter of Windows and '\r\n'. Trying any string as the newlparameter shows that too many lines are being added. Not only that, but other blanks (that may cause you problems when a machine reads the xml) are also added.
Some workarounds available at
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace
toprettyxml(newl='') works for me on Windows.
This is a pretty old question but I guess I know what the problem is:
Minidoms pretty print has a pretty straight forward method. It just adds the characters that you specified as arguments. That means, it will duplicate the characters if they already exist.
E.g. if you parse an XML file that looks like this:
<parent>
<child>
Some text
</child>
</parent>
there are already newline characters and indentions within the dom. Those are taken as text nodes by minidom and are still there when you parse it it into a dom object.
If you now proceed to convert the dom object into an XML string, those text nodes will still be there. Meaning new line characters and indent tabs are still remaining. Using pretty print now, will just add more new lines and more tabs. That's why in this case not using pretty print at all or specifying newl='' will result in the wanted output.
However, you generate the dom in your script, the text nodes will not be there, therefore pretty printing with newl='\r\n' and/or addindent='\t' will turn out quite pretty.
TL;DR Indents and newlines remain from parsing and pretty print just adds more
If you don't mind installing new packages, try beautifulsoup. I had very good experiences with its xml prettyfier.
Following function worked for my problem.
I had to use python 2.7 and i was not allowed to install any 3rd party additional package.
The crux of implementation is as follows:
Use dom.toprettyxml()
Remove all white spaces
Add new lines and tabs as per your requirement.
~
import os
import re
import xml.dom.minidom
import sys
class XmlTag:
opening = 0
closing = 1
self_closing = 2
closing_tag = "</"
self_closing_tag = "/>"
opening_tag = "<"
def to_pretty_xml(xml_file_path):
pretty_xml = ""
space_or_tab_count = " " # Add spaces or use \t
tab_count = 0
last_tag = -1
dom = xml.dom.minidom.parse(xml_file_path)
# get pretty-printed version of input file
string_xml = dom.toprettyxml(' ', os.linesep)
# remove version tag
string_xml = string_xml.replace("<?xml version=\"1.0\" ?>", '')
# remove empty lines and spaces
string_xml = "".join(string_xml.split())
# move each tag to new line
string_xml = string_xml.replace('>', '>\n')
for line in string_xml.split('\n'):
if line.__contains__(XmlTag.closing_tag):
# For consecutive closing tags decrease the indentation
if last_tag == XmlTag.closing:
tab_count = tab_count - 1
# Move closing element to next line
if last_tag == XmlTag.closing or last_tag == XmlTag.self_closing:
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)
pretty_xml = pretty_xml + line
last_tag = XmlTag.closing
elif line.__contains__(XmlTag.self_closing_tag):
# Print self closing on next line with one indentation from parent node
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * (tab_count+1)) + line
last_tag = XmlTag.self_closing
elif line.__contains__(XmlTag.opening_tag):
# For consecutive opening tags increase the indentation
if last_tag == XmlTag.opening:
tab_count = tab_count + 1
# Move opening element to next line
if last_tag == XmlTag.opening or last_tag == XmlTag.closing:
pretty_xml = pretty_xml + '\n' + (space_or_tab_count * tab_count)
pretty_xml = pretty_xml + line
last_tag = XmlTag.opening
return pretty_xml
pretty_xml = to_pretty_xml("simple.xml")
with open("pretty.xml", 'w') as f:
f.write(pretty_xml)
This gives me nice XML on Python 3.6, haven't tried on Windows:
dom = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml(newl='').replace("\n\n", "\n")
Are you viewing the resulting file on Windows? If so, try using toprettyxml(newl='\r\n').