I'm trying to dump some Python objects out into YAML.
Currently, regardless of YAML library (pyyaml, oyaml, or ruamel) I'm having an issue where calling .dump(MyObject) gives me correct YAML, but seems to add a lot of metadata about the Python objects that I don't want, in a form that looks like:
!!python/object:MyObject and other similar strings.
I do not need to be able to rebuild the objects from the YAML, so I am fine for this metadata to be removed completely
Other questions on SO indicate that the common solution to this is to use safe_dump instead of dump.
However, safe_dump does not seem to work for nested objects (or objects at all), as it throws this error:
yaml.representer.RepresenterError: ('cannot represent an object', MyObject)
I see that the common workaround here is to manually specify Representers for the objects that I am trying to dump. My issue here is that my Objects are generated code that I don't have control over. I will also be dumping a variety of different objects.
Bottom line: Is there a way to dump nested objects using .dump, but where the metadata isn't added?
Although the words "correct YAML" are not really accurate, and would be better phrased as
"YAML output looking like you want it, except for the tag information", this fortunately gives some
information on how you want your YAML to look, as there are an infinite number of ways to dump objects.
If you dump an object using ruamel.yaml:
import sys
import ruamel.yaml
class MyObject:
def __init__(self, a, b):
self.a = a
self.b = b
self.c = [a, b]
data = dict(x=MyObject(42, -1))
yaml = ruamel.yaml.YAML(typ='unsafe')
yaml.dump(data, sys.stdout)
this gives:
x: !!python/object:__main__.MyObject
a: 42
b: -1
c: [42, -1]
You have a tag !!python/object:__main__.MyObject (yours might differ depending on where the
class is defined, etc.) and each of the attributes of the class are dumped as keys of a mapping.
There are multiple ways on how to get rid of the tag in that dump:
Registering classes
Add a classmethod named to_yaml(), to each of your classes and
register those classes. You have to do this for each of your classes,
but doing so allows you to use the safe-dumper. An example on how to
do this can be found in the
documentation
Post-process
It is fairly easy to postprocess the output and remove the tags, which for objects always occur on the line
before the mapping, and you can delete from !!python until the end-of-line
def strip_python_tags(s):
result = []
for line in s.splitlines():
idx = line.find("!!python/")
if idx > -1:
line = line[:idx]
result.append(line)
return '\n'.join(result)
yaml.encoding = None
yaml.dump(data, sys.stdout, transform=strip_python_tags)
and that gives:
x:
a: 42
b: -1
c: [42, -1]
As achors are dumped before the tag, this "stripping from !!python
until end-of-the line", also works when you dump object that have
multiple references.
Change the dumper
You can also change the unsafe dumper routine for mappings to
recognise the tag used for objects and change the tag to the "normal"
one for dict/mapping (for which normally a tag is not output )
yaml.representer.org_represent_mapping = yaml.representer.represent_mapping
def my_represent_mapping(tag, mapping, flow_style=None):
if tag.startswith("tag:yaml.org,2002:python/object"):
tag = u'tag:yaml.org,2002:map'
return yaml.representer.org_represent_mapping(tag, mapping, flow_style=flow_style)
yaml.representer.represent_mapping = my_represent_mapping
yaml.dump(data, sys.stdout)
and that gives once more:
x:
a: 42
b: -1
c: [42, -1]
These last two methods work for all instances of all Python classes that you define without extra work.
Fast and hacky:
"\n".join([re.sub(r" ?!!python/.*$", "", l) for l in yaml.dump(obj).splitlines()]
"\n".join(...) – concat list to string agin
yaml.dump(obj).splitlines() – create list of lines of yaml
re.sub(r" ?!!python/.*$", "", l) – replace all yaml python tags with empty string
I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.
Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having
the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.
Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
class TopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""
class Element:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def __init__(self, elem, event):
if len(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None
#property
def state(self):
return self._state
#state.setter
def state(self, event):
self._state.append(event)
#property
def is_seq(self):
return self._type == self.SEQUENCE
def __str__(self):
return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def __init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""
super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
def parse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early and len(parent) == 1:
break
except etree.XMLSyntaxError:
pass
finally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""
for key in list(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def __str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""
return "Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
with open('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item
Items :Type:SEQUENCE, Count:0, Parent:None Events:['start']
Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start']
Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end']
... (omitted for brevity)
Step 2: Now, you know there is a <Main> Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
<Type>TVEpisode</Type>
<TVSeriesID>262603760</TVSeriesID>
</Main>
Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())
<Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1
For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces
My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')
There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?
As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.
Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.
It can be easy in Windows from 98SE on. It should be easy in other operating systems.
The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.
Edit this way before you send the file.
It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.
As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.
Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.
B Lean
Given a document containing a paragraph
d = docx.Document()
p = d.add_paragraph()
I expected the following technique to work every time:
if len(p._element) == 0:
# p is empty
OR
if len(p._p) == 0:
# p is empty
(Side question, what's the difference there? It seems that p._p is p._element in every case I've seen in the wild.)
If I add a style to my paragraph, the check no longer works:
>>> p2 = d.add_paragraph(style="Normal")
>>> print(len(p2._element))
1
Explicitly setting text=None doesn't help either, not that I would expect it to.
So how to I check if a paragraph is empty of content (specifically text and images, although more generic is better)?
Update
I messed around a little and found that setting the style apparently adds a single pPr element:
>>> p2._element.getchildren()
[<CT_PPr '<w:pPr>' at 0x7fc9a2b64548>]
The element itself it empty:
>>> len(p2._element.getchildren()[0])
0
But more importantly, it is not a run.
So my test now looks like this:
def isempty(par):
return sum(len(run) for run in par._element.xpath('w:r')) == 0
I don't know enough about the underlying system to have any idea if this is a reasonable solution or not, and what the caveats are.
More Update
Seems like I need to be able to handle a few different cases here:
def isempty(par):
p = par._p
runs = p.xpath('./w:r[./*[not(self::w:rPr)]]')
others = p.xpath('./*[not(self::w:pPr) and not(self::w:r)] and '
'not(contains(local-name(), "bookmark"))')
return len(runs) + len(others) == 0
This skips all w:pPr elements and runs with nothing but w:rPr elements. Any other element, except bookmarks, whether in the paragraph directly or in a run, will make the result non-empty.
The <w:p> element can have any of a large number of children, as you can see from the XML Schema excerpt here: http://python-docx.readthedocs.io/en/latest/dev/analysis/schema/ct_p.html (see the CT_P and EG_PContent definitions).
In particular, it often has a w:pPr child, which is where the style setting goes.
So your test isn't very reliable against false positives (if being empty is considered positive).
I'd be inclined to use paragraph.text == '', which parses through the runs.
A run can be empty (of text), so the mere presence of a run is not proof enough. The actual text is held in a a:t (text) element, which can also be empty. So the .text approach avoids all those low-level complications for you and has the benefit of being part of the API so much, much less likely to change in a future release.
I am trying to generate specific xml strings from my module, based on the type of string requested by some code using my module. Each of these xml strings contains some data that gets generated dynamically. For example, many of these xml strings have a cookie field that is generated using another function.
I started out with a python dictionary that initializes all the xml strings with the dynamic field (ie cookie) pre-populated (ie, not exactly dynamic). And then I call the dictionary to get the relevant xml strings.
The problem with this approach is that the cookies expire every hour and therefore the strings being returned by the module after an hour have those expired values. What I would ideally like to have is some form of a generator function (not sure if that's even possible in this case) that returns the correctly formed strings as and when they are requested based on the msg_type requested (as in the example below). Each of the xml strings saved in this dict is in a unique format, so I can't exactly have some sort of a common template xml generator.
As an example, the dict that I have defined looks similar to get_msg dictionary here:
get_msg["msg_value_1"] = """<ABC cookie=""" + getCookie() + """ >
<XYZ """ + foo_name +""">
</XYZ>
</ABC>"""
get_msg["msg_value_2"] = """<ABC cookie=""" + getCookie() + """ >
<some text """ + bar_name + """>
</XYZ>
</ABC>"""
What would be a good approach to be able to generate these xml strings on the fly with getCookie() getting invoked for every fresh msg request. Any input would be appreciated.
In Python, functions are first-class objects. This means you can pass a function as an argument to another function.
def get_msg(function_to_call_to_get_injection_bit, tag_name_function,
cookie_function):
tagname = tag_name_function()
injection = function_to_call_to_get_injection_bit()
cookie = cookie_function()
return '<ABC cookie="%s">' % (cookie) +
'<%s %s></%s></ABC>' % (tagname, injection, tagname)
def get_injection():
return foo_name
def get_tag_name_1():
return "XYZ"
def get_tag_name_2():
return "SomeText"
get_msg(get_injection, get_tag_name_1, getCookie)
get_msg(get_injection, get_tag_name_2, getCookie)
You just call get_msg every time you want a message, and pass it the function(s) that generate the portion of the message not related to the cookie.
From your question it's not entirely clear what the issue is. But this isn't really a "generator function": I don't think you want to return a function (which is possible), you want to return the XML string and just customize how it's built.
As BenDundee commented above, it'd probably also be better to use an XML library to build XML, instead of constructing the strings by hand. Python has several options built in, and more available outside (like the wonderful lxml library).