How can I convert to a pickle object to a xml document?
For example, I have a pickle like that:
cpyplusplus_test
Coordinate
p0
(I23
I-11
tp1
Rp2
.
I want to get something like:
<Coordinate>
<x>23</x>
<y>-11</y>
</Coordinate>
The Coordinate class has x and y attributes of course. I can supply a xml schema for conversion.
I tried gnosis.xml module. It can objectify xml documents to python object. But it cannot serialize objects to xml documents like above.
Any suggestion?
Thanks.
gnosis.xml does support pickling to XML:
import gnosis.xml.pickle
xml_str = gnosis.xml.pickle.dumps(obj)
To deserialize the XML, use loads:
o2 = gnosis.xml.pickle.loads(xml_str)
Of course, this will not directly convert existing pickles to XML — you have to first deserialize them into live object, and then dump them to XML.
Having said that, I must warn you that gnosis.xml is quite slow, somewhat fragile, and most likely unmaintained (last release was over six years ago). It is also very bloated, containing a huge number of subpackages with lots and lots of features that not only you won't need, but are untested and buggy. We tried to use for our development and, after a lot of effort wasted on trying to debug and improve it, ended up writing a simple XML pickler running at ~500 lines of code, and never looked back.
First you need to unpickle the data by pickle.load or pickle.loads. Then generate xml snippet. If you have a pickle in tmpStr variable, simply do this:
c = pickle.loads(tmpStr)
print '<Coordinate>\n<x>%d</x>\n<y>%d</y>\n</Coordinate>' % (c.x, c.y)
Writing to file is left as an exercise to the reader.
Related
This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>
Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works
You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser
I'm currently working on a simple Python 3.4.3 and Tkinter game.
I struggle with saving/reading data now, because I'm a beginner at coding.
What I do now is use .txt files to store my data, but I find this extremely counter-intuitive, as saving/reading more than one line of data requires of me to have additional code to catch any newlines.
Skipping a line would be terrible too.
I've googled it, but I either find .txt save/file options or way too complex ones for saving large-scale data.
I only need to save some strings right now and be able to access them (if possible) by key like in a dictionary key:value .
Do you know of any file format/method to help me accomplish that?
Also: If possible, should work on Win/iOS/Linux.
It sounds like using json would be best for this, which comes as part of the Python Standard library in Python-2.6+
import json
data = {'username':'John', 'health':98, 'weapon':'warhammer'}
# serialize the data to user-data.txt
with open('user-data.txt', 'w') as fobj:
json.dump(data, fobj)
# read the data back in
with open('user-data.txt', 'r') as fobj:
data = json.load(fobj)
print(data)
# outputs:
# {u'username': u'John', u'weapon': u'warhammer', u'health': 98}
A popular alternative is yaml, which is actually a superset of json and produces slightly more human readable results.
You might want to try Redis.
http://redis.io/
I'm not totally sure it'll meet all your needs, but it would probably be better than a flat file.
I was fairly confused about this for some time but I finally learned how to parse a large N-Triples RDF store (.nt) using Raptor and the Redland Python Extensions.
A common example is to do the following:
import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file:./mybigfile.nt")
for triple in model:
print triple.subject, triple.predicate, triple.object
Parse_into_model() by default loads the object into memory, so if you are parsing a big file you could consider using a HashStorage as your model and serializing it that way.
But what if you want to just read the file and say, add it to MongoDB without loading it into a Model or anything complicated like that?
import RDF
parser=RDF.NTriplesParser()
for triple in parser.parse_as_stream("file:./mybigNTfile.nt"):
print triple.subject, triple.predicate, triple.object
I am very new to Python and am not very familiar with the data structures in Python.
I am writing an automatic JSON parser in Python, the JSON message is read into a dictionary using Ultra-JSON:
jsonObjs = ujson.loads(data)
Now, if I try something like:
jsonObjs[param1][0][param2] it works fine
However, I need to get the path from an external source (I read it from the DB), we initially thought we'll just write in the DB:
myPath = [param1][0][param2]
and then try to access:
jsonObjs[myPath]
But after a couple of failures I realized I'm trying to access:
jsonObjs[[param1][0][param2]]
Is there a way to fix this without parsing myPath?
Many thanks for your help and advice
Store the keys in a format that preserves type information, e.g. JSON, and then use reduce() to perform recursive accesses on the structure.
Im using Python's built in XML parser to load a 1.5 gig XML file and it takes all day.
from xml.dom import minidom
xmldoc = minidom.parse('events.xml')
I need to know how to get inside that and measure its progress so I can show a progress bar.
any ideas?
minidom has another method called parseString() that returns a DOM tree assuming the string you pass it is valid XML, If I were to split up the file myself into chunks and pass them to parseString one at a time, could I possibly merge all the DOM trees back together at the end?
you usecase requires that you use sax parser instead of dom, dom loads everything in memory , sax instead will do line by line parsing and you write handlers for events as you need
so could be effective and you would be able to write progress indicator also
I also recommend trying expat parser sometime it is very useful
http://docs.python.org/library/pyexpat.html
for progress using sax:
as sax reads file incrementally you can wrap the file object you pass with your own and keep track how much have been read.
edit:
I also don't like idea of splitting file yourselves and joining DOM at end, that way you are better writing your own xml parser, i recommend instead using sax parser
I also wonder what your purpose of reading 1.5 gig file in DOM tree?
look like sax would be better here
Did you consider to use other means of parsing XML? Building a tree of such big XML files will always be slow and memory intensive. If you don't need the whole tree in memory, stream based parsing will be much faster. It can be a little daunting if you're used to tree based XML manipulation, but it will pay of in form of a huge speed increase (minutes instead of hours).
http://docs.python.org/library/xml.sax.html
I have something very similar for PyGTK, not PyQt, using the pulldom api. It gets called a little bit at a time using Gtk idle events (so the GUI doesn't lock up) and Python generators (to save the parsing state).
def idle_handler (fn):
fh = open (fn) # file handle
doc = xml.dom.pulldom.parse (fh)
fsize = os.stat (fn)[stat.ST_SIZE]
position = 0
for event, node in doc:
if position != fh.tell ():
position = fh.tell ()
# update status: position * 100 / fsize
if event == ....
yield True # idle handler stays until False is returned
yield False
def main:
add_idle_handler (idle_handler, filename)
Merging the tree at the end would be pretty easy. You could just create a new DOM, and basically append the individual trees to it one by one. This would give you pretty finely tuned control over the progress of the parsing too. You could even parallelize it if you wanted by spawning different processes to parse each section. You just have to make sure you split it intelligently (not splitting in the middle of a tag, etc.).