Many to Many data-structure in python

Many to Many data-structure in python - python

I was wondering how I could implement a many-to-many relationship data-structure. Or if something like this already exists in the wild.
What I would need is two groups of objects, where members from one group are relating to multiple members of the other group. And vice versa.
I also need the structure to have some sort of consistency, meaning members without any connections are dropped, or basically cannot exist.
I have seen this answer (it involves SQL-lite database), but I am not working with such huge volumes of objects, so it's not an appropriate answer for this context Many-to-many data structure in Python

Depending on how big your dataset is, you could simply build all possible sets and then assign booleans to see whether the relationship exists.
itertools.combinations
can be of help to generate all possible combinations.
Consistency can then be added by checking if any connections are True for each value.
I do not claim this is the prettiest approach, but it should work on smaller datasets.
https://docs.python.org/2/library/itertools.html#itertools.combinations

Related

What's the difference between index and internal ID in neo4j?

I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.

Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.

Tree of trees? Table of trees? What kind of data structure have I created?

I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?

The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.

Effective implementation of one-to-many relationship with Python NDB

I would like to hear your opinion about the effective implementation of one-to-many relationship with Python NDB. (e.g. Person(one)-to-Tasks(many))
In my understanding, there are three ways to implement it.
Use 'parent' argument
Use 'repeated' Structured property
Use 'repeated' Key property
I choose a way based on the logic below usually, but does it make sense to you?
If you have better logic, please teach me.
Use 'parent' argument
Transactional operation is required between these entities
Bidirectional reference is required between these entities
Strongly intend 'Parent-Child' relationship
Use 'repeated' Structured property
Don't need to use 'many' entity individually (Always, used with 'one' entity)
'many' entity is only referred by 'one' entity
Number of 'repeated' is less than 100
Use 'repeated' Key property
Need to use 'many' entity individually
'many' entity can be referred by other entities
Number of 'repeated' is more than 100
No.2 increases the size of entity, but we can save the datastore operations. (We need to use projection query to reduce CPU time for the deserialization though). Therefore, I use this way as much as I can.
I really appreciate your opinion.

A key thing you are missing: How are you reading the data?
If you are displaying all the tasks for a given person on a request, 2 makes sense: you can query the person and show all his tasks.
However, if you need to query say a list of all tasks say due at a certain time, querying for repeated structured properties is terrible. You will want individual entities for your Tasks.
There's a fourth option, which is to use a KeyProperty in your Task that points to your Person. When you need a list of Tasks for a person you can issue a query.
If you need to search for individual Tasks, then you probably want to go with #4. You can use it in combination with #3 as well.
Also, the number of repeated properties has nothing to do with 100. It has everything to do with the size of your Person and Task entities, and how much will fit into 1MB. This is potentially dangerous, because if your Task entity can potentially be large, you might run out of space in your Person entity faster than you expect.

One thing that most GAE users will come to realize (sooner or later) is that the datastore does not encourage design according to the formal normalization principles that would be considered a good idea in relational databases. Instead it often seems to encourage design that is unintuitive and anathema to established norms. Although relational database design principles have their place, they just don't work here.
I think the basis for the datastore design instead falls into two questions:
How am I going to read this data and how do I read it with the minimum number of read operations?
Is storing it that way going to lead to an explosion in the number of write and indexing operations?
If you answer these two questions with as much foresight and actual tests as you can, I think you're doing pretty well. You could formalize other rules and specific cases, but these questions will work most of the time.

Do python references use memory?

I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.

Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.

Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.

FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.

How do you efficiently bulk index lookups?

I have these entity kinds:
Molecule
Atom
MoleculeAtom
Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.
Both of these bulk lookups need to happen really fast. Right now I'm doing something like:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.
Ideas:
Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.
Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?

Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.
Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.