I would like to hear your opinion about the effective implementation of one-to-many relationship with Python NDB. (e.g. Person(one)-to-Tasks(many))
In my understanding, there are three ways to implement it.
Use 'parent' argument
Use 'repeated' Structured property
Use 'repeated' Key property
I choose a way based on the logic below usually, but does it make sense to you?
If you have better logic, please teach me.
Use 'parent' argument
Transactional operation is required between these entities
Bidirectional reference is required between these entities
Strongly intend 'Parent-Child' relationship
Use 'repeated' Structured property
Don't need to use 'many' entity individually (Always, used with 'one' entity)
'many' entity is only referred by 'one' entity
Number of 'repeated' is less than 100
Use 'repeated' Key property
Need to use 'many' entity individually
'many' entity can be referred by other entities
Number of 'repeated' is more than 100
No.2 increases the size of entity, but we can save the datastore operations. (We need to use projection query to reduce CPU time for the deserialization though). Therefore, I use this way as much as I can.
I really appreciate your opinion.
A key thing you are missing: How are you reading the data?
If you are displaying all the tasks for a given person on a request, 2 makes sense: you can query the person and show all his tasks.
However, if you need to query say a list of all tasks say due at a certain time, querying for repeated structured properties is terrible. You will want individual entities for your Tasks.
There's a fourth option, which is to use a KeyProperty in your Task that points to your Person. When you need a list of Tasks for a person you can issue a query.
If you need to search for individual Tasks, then you probably want to go with #4. You can use it in combination with #3 as well.
Also, the number of repeated properties has nothing to do with 100. It has everything to do with the size of your Person and Task entities, and how much will fit into 1MB. This is potentially dangerous, because if your Task entity can potentially be large, you might run out of space in your Person entity faster than you expect.
One thing that most GAE users will come to realize (sooner or later) is that the datastore does not encourage design according to the formal normalization principles that would be considered a good idea in relational databases. Instead it often seems to encourage design that is unintuitive and anathema to established norms. Although relational database design principles have their place, they just don't work here.
I think the basis for the datastore design instead falls into two questions:
How am I going to read this data and how do I read it with the minimum number of read operations?
Is storing it that way going to lead to an explosion in the number of write and indexing operations?
If you answer these two questions with as much foresight and actual tests as you can, I think you're doing pretty well. You could formalize other rules and specific cases, but these questions will work most of the time.
Related
I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.
Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.
I was wondering how I could implement a many-to-many relationship data-structure. Or if something like this already exists in the wild.
What I would need is two groups of objects, where members from one group are relating to multiple members of the other group. And vice versa.
I also need the structure to have some sort of consistency, meaning members without any connections are dropped, or basically cannot exist.
I have seen this answer (it involves SQL-lite database), but I am not working with such huge volumes of objects, so it's not an appropriate answer for this context Many-to-many data structure in Python
Depending on how big your dataset is, you could simply build all possible sets and then assign booleans to see whether the relationship exists.
itertools.combinations
can be of help to generate all possible combinations.
Consistency can then be added by checking if any connections are True for each value.
I do not claim this is the prettiest approach, but it should work on smaller datasets.
https://docs.python.org/2/library/itertools.html#itertools.combinations
I'm trying to figure out what is the most efficient way to store time-value pairs in pytables. I'm using pytables since I'm dealing with huge ammounts of data. I will need to perform calculations on the data (average, interpolate, etc.). I don't know the number of rows ahead of time.
I know that an EArray can be appended to, much like a Table. Is there a reason to chose one over the other?
Given my simple data structure (homogeneous time-value pairs) i figured an EArray would be faster/most efficient, but the following quote from the pytables creator himself threw me off:
"...PyTables is specially tuned for, well, tables.
And these entities wear special I/O buffers and query engines that are
fined tuned for maximum speed. *Array objects do not wear the same
machinery."quote location
If the columns have some particular meaning or name, then you should definitely use a Table.
The efficiency largely depends on what kinds of operations you are doing on the data. Most of the time there won't be much of a difference. EArray might be faster for row-access, Tables are probably slightly better at column access, and they should be very similar for whole Table/EArray access.
Of course, the moment you want to do something more than simply access element and instead want to query or transform the data, you should use a Table. Tables are really built up around this idea of querying, via where() methods, and indexing, which makes such operations very fast. EArrays lack this infrastructure and are therefore slower.
Which of these two strategies would be better for calculating upvotes/downvotes for a post:
These are model fields:
ups
downs
total
def save(self, *args, **kwargs): # Grab total value when needed
self.total = self.ups - self.downs
super.(yourmodel, self).save(*args, **kwargs)
Versus:
ups
downs
def total(ups, downs): # totals will just be computed when needed
return ups - downs # versus saved as a column
Is there really any difference? Speed? Style?
Thanks
I would do the latter. Generally, I wouldn't store any data that can be derived from other data in the database unless the calculation is time-consuming. In this case, it is a trivial calculation. The reason being that if you store derived data, you introduce the possibility for consistency errors.
Note that I would do the same with class instances. No need to store the total if you can make it a property. Less variables means less room for error.
I totally agree with #Caludiu. I would go with the second approach, but as always there are pros and cons:
The first approach seems harmless but it can give you some headaches in future. Think about your application evolution. What if you want to make some more calculous derived from values in your models? If you would want to be consistent, you will have to save them too in your database and then you will be saving a lot of "duplicated" information. The tables derived from your models won't be normalized and not only can grow unnecessarily but increase the posibility of consistency errors.
On the other hand, if you take the second approach, you won't have any problems about database design but you could fall into a lot of tough django queries because you need to do a lot of calculus to retrieve the information you want. These kind of calculus are riddiculy easy as an object method (or message, as you prefer) but when you want to do a query like this in django-style you will see how somethings get complicated.
Again, in my opinion, you should take the second approach. But it's on you to make the desicion you think fits better on your needs...
Hope it helps!
I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.