Apply an operation to almost all fields of a document in MongodDB?

Apply an operation to almost all fields of a document in MongodDB? - python

There are few (< 10) collections, for each collections I need to apply a fixed operation to certain fields of a document (will vary collection to collection). The operation needs to be applied to almost all fields, except few.
An approach I could think of:
To have a list of fields for each collection to which I needn't apply that operation, read all fields minus the ones present in the list and apply the operation.
Is there a better way to tackle this problem?

yes if the number of exluded fields is very < to included fields, your approach seems right.
but As far as I know there is now way of sending a complete document to a mongo[s|d] and tell it to only skip certain fields.

Related

In python django, would it be possible to extract data from database table and store it in an array?

I tried to extract the data using 'tablename.objects.Fname()' but I am still confused on how to store all the first names in the array from database.
if yes could anyone provide with an example, any sort of help would be appreciated.

You can obtain the values stored in a column by using .values(…), or .values_list(…). For example:
tablename.objects.values_list('Fname', flat=True)
This QuerySet is an iterable that for each record will contain one element with the cleaned value of that record. So if it is an ArrayField, it will contain a collection of lists.
But using an ArrayField [Django-doc] or other composite field is often not a good idea. It makes the items in the array harder to process, filter, JOIN, etc. Therefore it is often better to make an extra table, and define a many-to-one relation, for example with a ForeignKey [Django-doc].

Many to Many data-structure in python

I was wondering how I could implement a many-to-many relationship data-structure. Or if something like this already exists in the wild.
What I would need is two groups of objects, where members from one group are relating to multiple members of the other group. And vice versa.
I also need the structure to have some sort of consistency, meaning members without any connections are dropped, or basically cannot exist.
I have seen this answer (it involves SQL-lite database), but I am not working with such huge volumes of objects, so it's not an appropriate answer for this context Many-to-many data structure in Python

Depending on how big your dataset is, you could simply build all possible sets and then assign booleans to see whether the relationship exists.
itertools.combinations
can be of help to generate all possible combinations.
Consistency can then be added by checking if any connections are True for each value.
I do not claim this is the prettiest approach, but it should work on smaller datasets.
https://docs.python.org/2/library/itertools.html#itertools.combinations

SqlAlchemy mapped bulk update - make safer and faster?

I'm using Postgres 9.2 and SqlAlchemy. Currently, this is my code to update the rankings of my Things in my database:
lock_things = session.query(Thing).\
filter(Thing.group_id == 4).\
with_for_update().all()
tups = RankThings(lock_things) # return sorted tuple (<numeric>, <primary key Thing id>)
rank = 1
for prediction, id in tups:
thing = session.query(Thing).\
filter(Thing.group_id == 4).\
filter(Thing.id == id).one()
thing.rank = rank
rank += 1
session.commit()
However, this seems slow. It's also something I want to be atomic, which I why I use the with_for_update() syntax.
I feel like there must be a way to "zip" up the query and so an update in that way.
How can I make this faster and done all in one query?
EDIT: I think I need to create a temp table to join and make a fast update, see:
https://stackoverflow.com/a/20224370/712997
http://tapoueh.org/blog/2013/03/15-batch-update
Any ideas how to do this in SqlAlchemy?

Generally speaking with such operations you aim for two things:
Do not execute a query inside a loop
Reduce the number of queries required by performing computations on the SQL side
Additionally, you might want to merge some of the queries you have, if possible.
Let's start with 2), because this is very specific and often not easily possible. Generally, the fastest operation here would be to write a single query that returns the rank. There are two options with this:
The query is quick to run so you just execute it whenever you need the ranking. This would be the very simple case of something like this:
SELECT
thing.*,
(POINTS_QUERY) as score
FROM thing
ORDER BY score DESC
In this case, this will give you an ordered list of things by some artificial score (e.g. if you build some kind of competition). The POINTS_QUERY would be something that uses a specific thing in a subquery to determine its score, e.g. aggregate the points of all the tasks it has solved.
In SQLAlchemy, this would look like this:
score = session.query(func.sum(task.points)).filter(task.thing_id == Thing.id).correlate(Thing).label("score")
thing_ranking = session.query(thing, score).order_by(desc("score"))
This is somewhat a little bit more advanced usage of SQLAlchemy: We construct a subquery that returns a scalar value we labled score. With correlate we tell it that thing will come from an outer query (this is important).
So that was the case where you run a single query that gives you a ranking (the ranks a determined based on the index in the list and depend on your ranking strategy). If you can achieve this, it is the best case
The query itself is expensive you want the values cached. This means you can either use the solution above and cache the values outside of the database (e.g. in a dict or using a caching library). Or you compute them like above but update a database field (like Thing.rank). Again, the query from above gives us the ranking. Additionally, I assume the simplest kind of ranking: the index denotes the rank:
for rank, (thing, score) in enumerate(thing_ranking):
thing.rank = rank
Notice how I base my rank based on the index using enumerate. Additionally, I take advantage of the fact that since I just queried thing, I already have it in the session, so no need for an extra query. So this might be your solution right here, but read on for some additional info.
Using the last idea from above, we can now tackle 1): Get the query outside the loop. In general I noticed that you pass a list of things to a sorting function that only seems to return IDs. Why? If you can change it, make it so that it returns the things as a whole.
However, it might be possible that you cannot change this function so let's consider what we do if we can't change it. We already have a list of all relevant things. And we get a sorted list of their IDs. So why not build a dict as a lookup for ID -> Thing?
things_dict = dict(thing.id, thing for thing in lock_things)
We can use this dict instead of querying inside the loop:
for prediction, id in tups:
thing = things_dict[id]
However, it may be possible (for some reason I missed in your example) that not all IDs were returned previously. In that case (or in general) you can take advantage of a similar mapping SQLAlchemy keeps itself: You can ask it for a primary key and it will not query the database if it already has it:
for prediction, id in tups:
thing = session.query(Thing).get(id)
So that way we have reduced the problem and only execute queries for objects we don't already have.
One last thing: What if we don't have most of the things? Then I didn't solve your problem, I just replaced the query. In that case, you will have to create a new query that fetches all the elements you need. In general this depends on the source of the IDs and how they are determined, but you could always go the least efficient way (which is still way faster than inside-loop queries): Using SQL's IN:
all_things = session.query(Thing).filter(Thing.group_id == 4).filter(Thing.id.in_([id for _, id in tups]).all()
This would construct a query that filters with the IN keyword. However, with a large list of things this is terribly inefficient and thus if you are in this case, it is most likely better you construct some more efficient way in SQL that determines if this is an ID you want.
Summary
So this was a long text. So sum up:
Perform queries in SQL as much as possible if you can write it efficiently there
Use SQLAlchemy's awesomeness to your advantage, e.g. create subqueries
Try to never execute queries inside a loop
Create some mappings for yourself (or use that of SQLAlchemy to your advantage)
Do it the pythonic way: Keep it simple, keep it explicit.
One final thought: If your queries get really complex and you fear you loose control over the queries executed by the ORM, drop it and use the Core instead. It is almost as awesome as the ORM and gives you huge amounts of control over the queries as you build them yourselves. With this you can construct almost any SQL query you can think of and I am certain that the batch updates you mentioned are also possible here (If you see that my queries above lead to many UPDATE statements, you might want to use the Core).

Effective implementation of one-to-many relationship with Python NDB

I would like to hear your opinion about the effective implementation of one-to-many relationship with Python NDB. (e.g. Person(one)-to-Tasks(many))
In my understanding, there are three ways to implement it.
Use 'parent' argument
Use 'repeated' Structured property
Use 'repeated' Key property
I choose a way based on the logic below usually, but does it make sense to you?
If you have better logic, please teach me.
Use 'parent' argument
Transactional operation is required between these entities
Bidirectional reference is required between these entities
Strongly intend 'Parent-Child' relationship
Use 'repeated' Structured property
Don't need to use 'many' entity individually (Always, used with 'one' entity)
'many' entity is only referred by 'one' entity
Number of 'repeated' is less than 100
Use 'repeated' Key property
Need to use 'many' entity individually
'many' entity can be referred by other entities
Number of 'repeated' is more than 100
No.2 increases the size of entity, but we can save the datastore operations. (We need to use projection query to reduce CPU time for the deserialization though). Therefore, I use this way as much as I can.
I really appreciate your opinion.

A key thing you are missing: How are you reading the data?
If you are displaying all the tasks for a given person on a request, 2 makes sense: you can query the person and show all his tasks.
However, if you need to query say a list of all tasks say due at a certain time, querying for repeated structured properties is terrible. You will want individual entities for your Tasks.
There's a fourth option, which is to use a KeyProperty in your Task that points to your Person. When you need a list of Tasks for a person you can issue a query.
If you need to search for individual Tasks, then you probably want to go with #4. You can use it in combination with #3 as well.
Also, the number of repeated properties has nothing to do with 100. It has everything to do with the size of your Person and Task entities, and how much will fit into 1MB. This is potentially dangerous, because if your Task entity can potentially be large, you might run out of space in your Person entity faster than you expect.

One thing that most GAE users will come to realize (sooner or later) is that the datastore does not encourage design according to the formal normalization principles that would be considered a good idea in relational databases. Instead it often seems to encourage design that is unintuitive and anathema to established norms. Although relational database design principles have their place, they just don't work here.
I think the basis for the datastore design instead falls into two questions:
How am I going to read this data and how do I read it with the minimum number of read operations?
Is storing it that way going to lead to an explosion in the number of write and indexing operations?
If you answer these two questions with as much foresight and actual tests as you can, I think you're doing pretty well. You could formalize other rules and specific cases, but these questions will work most of the time.

How to store data pairs in django without an extra model?

I want to create an app that stores bills for me. Since I don't have fixed prices for anything there is no need to store my services in an extra model. I just want to store data pairs of "action" and "price" to print them out nicely in a table.
Is there something in django that can help me with that task or should I just put all data pairs together in a textfield and explode it every time i want to use it ?
The number of data pairs per bill is not fixed. Data pairs are used only in one bill, so i don't want an extra table.

Instead of a plain TextField you should look at field types that are better suited for storing structured data: In addition to Acorn's suggestion (django-hstore) a JsonField or a PickleField might be suitable (and more portable) solutions for your use case.

You might be interested in Postgres's hstore: http://www.postgresql.org/docs/9.1/static/hstore.html
https://github.com/jordanm/django-hstore/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.