I'm working on a python program that allows the user to categorise files by attaching 'tags' to them. These tags can stand in hierarchical relationships to one another. For example, the 'cat' tag can be categorized as a "descendant" of the 'mammal' tag. As a consequence, once a file is tagged as 'dog', it can be accessed via the 'mammal' tag as well.
These tags and their relationships to each other and to files will obviously need to be stored in a database, and I'm most familiar with relational databases.
I very much like the Modified Pre-order Tree Traversal method for storing trees in a relational database because it removes the need for recursion and requires fewer database queries.
However, I also want to facilitate tags with multiple parents. For example, 'dog' could be a child of 'mammal' and also of 'four-legged-thing' where not all four legged things are mammals or even animals (e.g. tables), and the 'mammal' and 'four-legged-thing' tags have no "common ancestor".
Does anyone know of a method of representing such relationships in a database while maintaining some of the advantages of the MPTT method?
Thanks for any help.
What you are describing is an acyclic directed graph, not a tree, so you can't use any of the sql "tree-storage" methods like MPTT. Here is an article that demonstrates an adjacency-list approach to this problem.
I highly recommend that you do not go down this path, however, not because of the difficulty of implementation, but because you will end up confusing and frustrating your users. In my experience users make poor use of complex ontological systems and are easily confused by them. Either use a flat "tag" namespace with no parent-child relationships, or use a tree arrangement with at most one parent per node.
But if you want to have a graph, he most straightforward way is to have a table like this:
CREATE TABLE tag_relationships (
tag_child_id INTEGER NOT NULL REFERENCES tags (id) ON UPDATE CASCADE ON DELETE CASCADE,
tag_parent_id INTEGER NOT NULL REFERENCES tags (id) ON UPDATE CASCADE ON DELETE CASCADE,
PRIMARY KEY (tag_child_id, tag_parent_id)
);
You will probably not be able to avoid recursive queries. When you want to create a matching search, use the tags you have as search criteria and recursively add child tags until you have a complete tag list.
You will also have to be careful about creating cycles. When you add a relationship, you need to recursively visit parents and make sure you don't end up at the same node twice.
Something you can do to avoid recursive queries and help detect cycles is to denormalize your data a bit by making all relationships explicit for every node. What I mean is, suppose A is a child of B and C, and C is a child of D.
Instead of the minimum number of edges necessary to represent this fact:
tag_child_id tag_parent_id
A B
A C
C D
You would make all implicit relationships (ones you would have had to find via recursion) explicit:
A B
A C
A D
C D
Notice that I added (A, D).
Related
I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.
Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.
I was wondering how I could implement a many-to-many relationship data-structure. Or if something like this already exists in the wild.
What I would need is two groups of objects, where members from one group are relating to multiple members of the other group. And vice versa.
I also need the structure to have some sort of consistency, meaning members without any connections are dropped, or basically cannot exist.
I have seen this answer (it involves SQL-lite database), but I am not working with such huge volumes of objects, so it's not an appropriate answer for this context Many-to-many data structure in Python
Depending on how big your dataset is, you could simply build all possible sets and then assign booleans to see whether the relationship exists.
itertools.combinations
can be of help to generate all possible combinations.
Consistency can then be added by checking if any connections are True for each value.
I do not claim this is the prettiest approach, but it should work on smaller datasets.
https://docs.python.org/2/library/itertools.html#itertools.combinations
I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?
The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.
I would like to hear your opinion about the effective implementation of one-to-many relationship with Python NDB. (e.g. Person(one)-to-Tasks(many))
In my understanding, there are three ways to implement it.
Use 'parent' argument
Use 'repeated' Structured property
Use 'repeated' Key property
I choose a way based on the logic below usually, but does it make sense to you?
If you have better logic, please teach me.
Use 'parent' argument
Transactional operation is required between these entities
Bidirectional reference is required between these entities
Strongly intend 'Parent-Child' relationship
Use 'repeated' Structured property
Don't need to use 'many' entity individually (Always, used with 'one' entity)
'many' entity is only referred by 'one' entity
Number of 'repeated' is less than 100
Use 'repeated' Key property
Need to use 'many' entity individually
'many' entity can be referred by other entities
Number of 'repeated' is more than 100
No.2 increases the size of entity, but we can save the datastore operations. (We need to use projection query to reduce CPU time for the deserialization though). Therefore, I use this way as much as I can.
I really appreciate your opinion.
A key thing you are missing: How are you reading the data?
If you are displaying all the tasks for a given person on a request, 2 makes sense: you can query the person and show all his tasks.
However, if you need to query say a list of all tasks say due at a certain time, querying for repeated structured properties is terrible. You will want individual entities for your Tasks.
There's a fourth option, which is to use a KeyProperty in your Task that points to your Person. When you need a list of Tasks for a person you can issue a query.
If you need to search for individual Tasks, then you probably want to go with #4. You can use it in combination with #3 as well.
Also, the number of repeated properties has nothing to do with 100. It has everything to do with the size of your Person and Task entities, and how much will fit into 1MB. This is potentially dangerous, because if your Task entity can potentially be large, you might run out of space in your Person entity faster than you expect.
One thing that most GAE users will come to realize (sooner or later) is that the datastore does not encourage design according to the formal normalization principles that would be considered a good idea in relational databases. Instead it often seems to encourage design that is unintuitive and anathema to established norms. Although relational database design principles have their place, they just don't work here.
I think the basis for the datastore design instead falls into two questions:
How am I going to read this data and how do I read it with the minimum number of read operations?
Is storing it that way going to lead to an explosion in the number of write and indexing operations?
If you answer these two questions with as much foresight and actual tests as you can, I think you're doing pretty well. You could formalize other rules and specific cases, but these questions will work most of the time.
I have these entity kinds:
Molecule
Atom
MoleculeAtom
Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.
Both of these bulk lookups need to happen really fast. Right now I'm doing something like:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.
Ideas:
Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.
Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?
Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.
Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.