Preserving the order of elements when parsing Microdata in Python

Preserving the order of elements when parsing Microdata in Python - python

I am facing the following problem:
When I am parsing an HTML document with Microdata markup in Python with rdflib, the ordering of elements is lost (which is a natural consequence of RDF not having an order for multiple elements).
E.g. the value method often returns the element that was the first value in the original markup, but not reliably.
Now, sometimes it will be very handy to preserve the original order. Is there a way to tell rdflib to return an ordered list for multiple values?
Or is there a simple Microdata-to-JSON (or JSON-LD) library for Python?
Thanks!

I actually found a very efficient way: Instead of parsing the Microdata into RDF with rdflib, I used Ed Summer's Microdata library at
https://github.com/edsu/microdata
This preserves the original order and is by far the simplest solution I found.

Related

How to save an AST generated by ANTLR

I have successfully generated an AST using ANTLR in python but I cannot figure out for the life of me how I can save this for later use. The only option I have been able to figure out is to use tree.toStringTree() method, but the output of this is messy and not overly convenient or easy to work with.
How do I save it and what format would be best/easiest to work with and be able to visualise and load it in in the future?
EDIT: I can see in the java documentation there is a DotGenerator() to generate a DOT file of the tree, but I can't find a way to do anything like this in python.

What you are looking for is a serializer/deserializer of the parse tree. Serialization was previously addressed in StackOverflow here. It isn't supported in the runtime (ASAIK) because it is not useful: one can reconstruct the tree very quickly by re-parsing the input. Even if you want to change the tree using a transformation, you can replace the nodes in the tree with sub-trees with node types that don't even exist in your parser, print out the tree, then re-parse to reconstruct the tree with the parse types for your grammar. It only makes sense if parsing with semantic analysis is very slow. So, you should consider the problem carefully.
However, it's not difficult to write a crude serializer/deserializer that does not consider "off-channel" content like spacing or comments. This C# program (which you could adapt to python) is an example that reconstructs the tree using the grammars-v4/sexpression.g4 grammar for a target grammar arithmetic.g4. Using toStringTree(rule-names), the tree is first serialized into a string. (Note, toStringTree() without the parser rule names is difficult to read, that is why I asked.) Then, the s-expression is parsed and a bottom-up reconstruction is performed using an Antlr visitor. Since toStringTree() does not mark the parse tree leaves with the type of the token (e.g., to distinguish between a number versus a symbol), the string is lexed to reconstruct the value. It also uses reflection to create the correct parse tree node type.
Outputting a Dot graph for the parse tree is also easy, which I included in the program, using a top-down recursive visitor. Here, the recursive function outputs each edge to a child for a particular node. Since each node name has to be unique (it's a tree), I added the pre-order tree number for the node to the name.
--Ken

What is the standard way to represent subsequent changes in a text and to work with this representation using Python?

Assume that I have some text (for example given as a string). Later I am going to "edit" this text, which means that I want to add something somewhere or remove something. In this way I will get another version of the text. However, I do not want to have two strings representing each version of the text since there are a lot of "repetitions" (similarities) between the two subsequent versions. In other words, the differences between the strings are small, so that it makes more sense just to save differences between them. For example, the first versions.
This is my first version of the texts.
The second version:
This is the first version of the text, that I want to use as an example.
I would like to save these two versions as one object (it should not necessarily be XML, I use it just as an example):
This is the <removed>my</removed> <added>first</added> version of the text<added>, that I want to use as an example</added>.
Now I want to go further. I want to save all subsequent edits as one object. In other words, I am going to have more than two versions of the text, but I would like to save them as one object such that it is easy to get a given version of the text and easy to find out what are the difference between two subsequent (or any two given) versions.
So, to summarize, my question is: What is the standard way to represent changes in a text and to work with this representation using Python.

I would probably go with difflib: https://docs.python.org/2/library/difflib.html
You can use it to represent changes between versions of string and create your own class to store consecutive diffs.
EDIT: I just realised it doesn't really make sense in your use case as the diffs from difflib are essentially storing both strings, so you will be better off in just storing them all. However I believe that this is the standard (library-wise) way of working with changes in text, so I won't delete this answer.
EDIT2: Although it seems that if you find a way to apply unified_diff to strings this may be your answer. It seems that there is no way to do this with difflib yet: https://bugs.python.org/issue2057

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Mongodb with Python's "set()" type

I am building a web App with mongoDB as the backend. Some of the documents need to store a collection of items in some sort of list, and then the system will need to frequently check if a specified item is present in that list. Using Python's 'in' operator takes Big-O(N) time, n being the size of the list. Since these list can get quite large, I want something faster than that. Python's 'set' type does this operation in constant time (and enforces uniqueness, which is good in my case), but is considered an invalid data type to put in MongoDB.
So what's the best way to do this? Is there some way to just use a regular list and exploit mongo's indexing features? Again, I want to know, for a given document in a collection, does a list inside that document contain particular element?

You can represent a set using a dictionary. Your elements become the keys, and all the values can be set to a constant such as 1. The in operator checks for the existence of a key.
EDIT. MongoDB stores a dict as a BSON document, where the keys must be strings (with some additional restrictions), so the above advice is of limited use.

MongoDB lists with paginations?

for documents with lists with pagination, is it better to embed or use
reference? im reading the custom type "SONManipulator" and it appears
to transform every thing on retrieval, even the sub docs.
i want to keep the list in the document sorted, should this impact
anything?

I don't fully understand your question, but it is generally better to embed documents for performance reasons. That is one of the major advantages of MongoDB's approach, data locality. The pymongo lib uses SON sorted dict implementation which will maintain the ordering of your document keys.
If you document contains a list/array of elements and you are concerned about the ordering of the elements, fear not because the array order is maintained as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.