I am trying to write a python script to standardise generic XML files, used to configure websites and website forms. However to do this I would like to either maintain the original attribute ordering of the elements, or even better be able to rearrange them in a pre-defined way. Currently most xml parsers I have tried re-write the attribute order to be alpha-numeric. As these XML files are human read/written and maintained, this isn't too useful.
For example a generic element may look like this in the XML;
<Question QuestionRef="XXXXX" DataType="Integer" Text="Question Text" Availability="Shown" DefaultAnswer="X">
However once passed through elementtree and re-written to a new file this is changed to:
<Question Availability="Shown" DataType="Integer" DefaultAnswer="X" PartType="X" QuestionRef="XXXXX" Text="Question Text">
As the aim of the script is to standardise a large number of XML files in order to increase readability between colleagues and that the information contained within the element's attributes have varying levels of significance (Eg. QuestionRef is highly important), dicates that attributes need to be sensibly ordered.
I understand that python dicts (which attributes are stored in) are naturally unordered and XML specification states attribute ordering is insignificant, but this the human readability factor is the driving force behind the script.
In other questions (on Stack Overflow) similar to this one I have seen it remarked that pxdom can do this (question link: link), but I cannot find any mention of how it may to do this in pxdom documentation or using a google search. So is there some way to maintain an order of attributes or define it with current XML parsers? Preferably without resorting to hotpatching :)!
Any help anyone can provide would be greatly appreciated :).
Apply monkey patch as mentioned below::
in ElementTree.py file, there is a function named as _serialize_xml;
in this function; apply the below mentioned patch;
##for k, v in sorted(items): # remove the sorted here
for k, v in items:
if isinstance(k, QName):
k = k.text
if isinstance(v, QName):
v = qnames[v.text]
else:
v = _escape_attrib(v, encoding)
write(" %s=\"%s\"" % (qnames[k], v))
here; remove the sorted(items) and make it just items like i have done above.
Also to disable sorting based on namespace(because in above patch; sorting is still present when namespace is present for xml attribute; otherwise if namespace is not present; then above is working fine); so to do that, replace all {} with collections.OrderedDict() from ElementTree.py
Now you have all attributes in a order as you have added them to that xml element.
Before doing all of above; read the copyright message by Fredrik Lundh that is present in ElementTree.py
Related
I am running topic modeling using Gensim. Before creating the document-term matrix, one needs to create a dictionary of tokens.
dictionary = corpora.Dictionary(tokenized_reviews)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]
But, I don't understand what kind of object "dictionary" is.
So, when I type:
type(dictionary)
I get
gensim.corpora.dictionary.Dictionary
Is this a dictionary ( a kind of data structure)? If so, why can't I see the content (I am just curious)?
When I type
dictionary
I get:
<gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0>
The same issue exists with some of the objects in NLTK.
If this is a dictionary (as a data structure), why I am not able to see the keys and values like any other Python dictionary?
Thanks,
Navid
This is a specific Dictionary class implemented by the Gensim project.
It will be very similar in interface to the standard Python dict (and other various Dictionary/HashMap/etc types you may have used elsewhere).
However, to see exactly what it can do, you should consult the class-specific documentation:
https://radimrehurek.com/gensim/corpora/dictionary.html
Like a dict, you can do typical operations:
len(dictionary) # gets number of entries
dictionary[key] # gets the value at a certain key (word)
dictionary.keys() # gets all stored keys
The reason you see a generic <gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0> when you try to display the value of the dictionary itself is that it hasn't defined any convenience display-string with more info - so you're seeing the default for any random Python object. (Such dictionaries are usually far too large to usefull dump their full contents whenever asked, generically, to "show yourself".
I have a python 3.5 tuple where a typical structure of a data item is something like this.
item = (PosixPath('/mnt/dson/Music/iTunes/iTunes Music/funtoons.mp3'), tagtypes(txt=False, word=False, ebook=False, image=False, exe=False, iso=False, zip=False, raw=False, audio=True, music=True, photoshop=False, video=False, src=False, geek=False, pdf=False, appledouble=False, dot=False), fileinfo(size=13229145, datetime=1333848240.0))
This describes a common file on my Linux filesystem. If I want to know the size
of the given file, I can access it with something like this ---
item[2].size. Similarly, logic to grab the tags describing the file's contents would use code like --- item[1].music, etc..
It seems on the face of it, with each object being unique in the tuple
that if you wanted to access one of the members, you should be able to
drill down in the tuple and do something like item.fileinfo.size. All of
the information to select the correct item from the tuple is deducible
to the interpreter. However, if you do attempt something like
item.fileinfo.size you will get (almost expectedly) an attribute error.
I could create a namedtuple of namedtuples but that has a bit of a code smell to it.
I'm wondering if there is a more pythonic way to access the members of the tuple other than by indexing or unpacking. Is there some kind of
shorthand notation such that you convey to the interpreter which one of
the tuple's elements you must be referencing (because none of the other
options will fit the pattern)?
This is kind of a hard thing to explain and I'm famous for leaving out
critical parts. So if more info is needed by the first responders, please
let me know and I'll try and describe the idea more fully.
You really think doing this:
Item = namedtuple('Item', 'PosixPath tagtypes fileinfo')
item = Item(PosixPath('/mnt/dson/Music/iTunes/iTunes Music/funtoons.mp3'), tagtypes(txt=False, word=False, ebook=False, image=False, exe=False, iso=False, zip=False, raw=False, audio=True, music=True, photoshop=False, video=False, src=False, geek=False, pdf=False, appledouble=False, dot=False), fileinfo(size=13229145, datetime=1333848240.0))
is not worth it if it lets you do item.fileinfo.size AND item[2].size ? That's pretty clean. It avoids creating classes by hand and gives you all the functionality in a clear and concise manner. Seems like pretty good python to me.
I have to parse an XML file with a large number of string values. For example:
<value>Foo</value>
<value>Bar</value>
<value>Baz</value>
<value>Foo</value>
Some of them are equal. There are multiple recurring strings, not just one as in the example above. Hence I would like to detect such values, and link them with XLink: to create a reference at one of the instances of a recurring string (doesn't have to be at the first one), and to link the rest (I can use UUIDs), like here:
<value id="D5494447-A010-4F81-9DDA-E5DFFBD616FF">Foo</value>
<value>Bar</value>
<value>Baz</value>
<value href="#D5494447-A010-4F81-9DDA-E5DFFBD616FF"/>
I am starting with XLinks so perhaps the above doesn't make sense. If that is not possible, another possibility is that I can create a dictionary containing such values:
{'D5494447-A010-4F81-9DDA-E5DFFBD616FF' : 'Foo'}
And then somehow put them in the XML. What is the simplest way to achieve these? I don't care much about the most efficient way as long as the method is correct and simple to implement, since I am a Python beginner and not a computer scientist, and computational complexity is not an issue. Parsing and writing XMLs is not an issue (I figured it out with lxml), so the question here is only about the detection of recurring strings and their linking.
One way is to maintain a dict (a mapping from arbitrary keys to values) of all strings that you have seen before. So, let's assume you are at the point where you have the value in a variable val, and that there is a dict valdict that's initially empty. The code you would need is something like this:
import uuid
if val in valdict: # We have seen this reference before
print '<value href="#%s"/>' % valdict[val]
else: # We need to add this reference
valdict[val] = str(uuid.uuid4()).upper()
print '<value id="%s">%s</value>' % (valdict[val], val)
I wouldn't really recommend this simplistic method for forming the XML iself, but it sounds like you are already well-prepared to handle that side of things.
I've just started using yaml and I love it. However, the other day I came across a case that seemed really odd and I am not sure what is causing it. I have a list of file path locations and another list of file path destinations. I create a dictionary out of them and then use yaml to dump it out to read later (I work with artists and use yaml so that it is human readable as well).
sorry for the long lists:
source = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr']
dest = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr']
dictionary = dict(zip(source, dest))
print yaml.dump(dictionary)
this is the output that I get:
{/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhaw
k_diff_diffuse_v0006.exr,
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v00
06/blackhawk_maskBurnt_diffuse_v0006.1031.exr,
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr}
It comes back in fine with a yaml.load, but this is not useful for artists to be able to edit if need be.
This is the first question in the FAQ.
By default, PyYAML chooses the style of a collection depending on whether it has nested collections. If a collection has nested collections, it will be assigned the block style. Otherwise it will have the flow style.
If you want collections to be always serialized in the block style, set the parameter default_flow_style of dump() to False.
So:
>>> print yaml.dump(dictionary, default_flow_style=False)
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr
Still not exactly beautiful, but when you have strings longer than 80 characters as keys, it's about as good as you can reasonably expect.
If you model (part of) the filesystem hierarchy in your object hierarchy, or create aliases (or dynamic aliasers) for parts of the tree, etc., the YAML will look a lot nicer. But that's something you have to actually do at the object-model level; as far as YAML is concerned, those long paths full of repeated prefixes are just strings.
I've noticed with my source control that the content of the output files generated with ConfigParser is never in the same order. Sometimes sections will change place or options inside sections even without any modifications to the values.
Is there a way to keep things sorted in the configuration file so that I don't have to commit trivial changes every time I launch my application?
Looks like this was fixed in Python 3.1 and 2.7 with the introduction of ordered dictionaries:
The standard library now supports use
of ordered dictionaries in several
modules. The configparser module uses
them by default. This lets
configuration files be read, modified,
and then written back in their
original order.
If you want to take it a step further than Alexander Ljungberg's answer and also sort the sections and the contents of the sections you can use the following:
config = ConfigParser.ConfigParser({}, collections.OrderedDict)
config.read('testfile.ini')
# Order the content of each section alphabetically
for section in config._sections:
config._sections[section] = collections.OrderedDict(sorted(config._sections[section].items(), key=lambda t: t[0]))
# Order all sections alphabetically
config._sections = collections.OrderedDict(sorted(config._sections.items(), key=lambda t: t[0] ))
# Write ini file to standard output
config.write(sys.stdout)
This uses OrderdDict dictionaries (to keep ordering) and sorts the read ini file from outside ConfigParser by overwriting the internal _sections dictionary.
No. The ConfigParser library writes things out in dictionary hash order. (You can see this if you look at the source code.) There are replacements for this module that do a better job.
I will see if I can find one and add it here.
http://www.voidspace.org.uk/python/configobj.html#introduction is the one I was thinking of. It's not a drop-in replacement, but it is very easy to use.
ConfigParser is based on the ini file format, who in it's design is supposed to NOT be sensitive to order. If your config file format is sensitive to order, you can't use ConfigParser. It may also confuse people if you have an ini-type format that is sensitive to the order of the statements...