Override SQL Alchemy ORM filter condition - python

How can I undo a filter condition on an SQLAlchemy query? E.g.
q = Model.query.filter(Model.age > 25)
q = q.remove_filter(Model.age) # what is this called?

In addition to other suggestions, the simple (but if it's actually simple depends on your exact situation) solution can be to just not apply the filters you don't need.
Most probably you have some complex code that conditionally applies the filters and then you have a case when you want to undo some filters.
In this case, instead of applying all the filters to the query, you could collect the filters you need into some collection, for example a dictionary or a list.
Then you remove filters you don't need from the dictionary and then you actually apply these collected filters to the query object.

I don't believe that feature currently exists. That being said, you can view (and delete) the where clauses (filters) by directly accessing the q.whereclause.clauses list of BinaryExpression objects.
If that's your only filter, something like this would work:
q.whereclause.clauses.pop()
But it gets a little trickier (and hackier) if you want to remove a specific filter out of many.

Related

In python django, would it be possible to extract data from database table and store it in an array?

I tried to extract the data using 'tablename.objects.Fname()' but I am still confused on how to store all the first names in the array from database.
if yes could anyone provide with an example, any sort of help would be appreciated.
You can obtain the values stored in a column by using .values(…), or .values_list(…). For example:
tablename.objects.values_list('Fname', flat=True)
This QuerySet is an iterable that for each record will contain one element with the cleaned value of that record. So if it is an ArrayField, it will contain a collection of lists.
But using an ArrayField [Django-doc] or other composite field is often not a good idea. It makes the items in the array harder to process, filter, JOIN, etc. Therefore it is often better to make an extra table, and define a many-to-one relation, for example with a ForeignKey [Django-doc].

Does Dask/Pandas support removing rows in a group based on complex conditions that rely on other rows?

I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python?
I know nothing about Dask, but can tell a little on passing / blocking
some rows using Pandas.
It is possible to use groupby(...).apply(...) to "filter" the
source DataFrame.
Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns
first 2 rows from each group.
In your case, write a function to applied to each group, which:
contains some logic, processing the current group,
generates the output DataFrame, based on this logic, e.g. returning
only some of input rows.
The returned rows are then concatenated, forming the result of apply.
Another possibility is to use groupby(...).filter(...), but in this
case the underlying function returns a decision "passing" or "blocking"
each group of rows.
Yet another possibility is to define a "filtering function",
say filtFun, which returns True (pass the row) or False (block the row).
Then:
Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows
passed the filter).
In further processing use df[msk], i.e. only these rows which passed
the filter.
But in this case the underlying function has acces only to the current row,
not to the whole group of rows.

Apply an operation to almost all fields of a document in MongodDB?

There are few (< 10) collections, for each collections I need to apply a fixed operation to certain fields of a document (will vary collection to collection). The operation needs to be applied to almost all fields, except few.
An approach I could think of:
To have a list of fields for each collection to which I needn't apply that operation, read all fields minus the ones present in the list and apply the operation.
Is there a better way to tackle this problem?
yes if the number of exluded fields is very < to included fields, your approach seems right.
but As far as I know there is now way of sending a complete document to a mongo[s|d] and tell it to only skip certain fields.

Many to Many data-structure in python

I was wondering how I could implement a many-to-many relationship data-structure. Or if something like this already exists in the wild.
What I would need is two groups of objects, where members from one group are relating to multiple members of the other group. And vice versa.
I also need the structure to have some sort of consistency, meaning members without any connections are dropped, or basically cannot exist.
I have seen this answer (it involves SQL-lite database), but I am not working with such huge volumes of objects, so it's not an appropriate answer for this context Many-to-many data structure in Python
Depending on how big your dataset is, you could simply build all possible sets and then assign booleans to see whether the relationship exists.
itertools.combinations
can be of help to generate all possible combinations.
Consistency can then be added by checking if any connections are True for each value.
I do not claim this is the prettiest approach, but it should work on smaller datasets.
https://docs.python.org/2/library/itertools.html#itertools.combinations

SqlAlchemy mapped bulk update - make safer and faster?

I'm using Postgres 9.2 and SqlAlchemy. Currently, this is my code to update the rankings of my Things in my database:
lock_things = session.query(Thing).\
filter(Thing.group_id == 4).\
with_for_update().all()
tups = RankThings(lock_things) # return sorted tuple (<numeric>, <primary key Thing id>)
rank = 1
for prediction, id in tups:
thing = session.query(Thing).\
filter(Thing.group_id == 4).\
filter(Thing.id == id).one()
thing.rank = rank
rank += 1
session.commit()
However, this seems slow. It's also something I want to be atomic, which I why I use the with_for_update() syntax.
I feel like there must be a way to "zip" up the query and so an update in that way.
How can I make this faster and done all in one query?
EDIT: I think I need to create a temp table to join and make a fast update, see:
https://stackoverflow.com/a/20224370/712997
http://tapoueh.org/blog/2013/03/15-batch-update
Any ideas how to do this in SqlAlchemy?
Generally speaking with such operations you aim for two things:
Do not execute a query inside a loop
Reduce the number of queries required by performing computations on the SQL side
Additionally, you might want to merge some of the queries you have, if possible.
Let's start with 2), because this is very specific and often not easily possible. Generally, the fastest operation here would be to write a single query that returns the rank. There are two options with this:
The query is quick to run so you just execute it whenever you need the ranking. This would be the very simple case of something like this:
SELECT
thing.*,
(POINTS_QUERY) as score
FROM thing
ORDER BY score DESC
In this case, this will give you an ordered list of things by some artificial score (e.g. if you build some kind of competition). The POINTS_QUERY would be something that uses a specific thing in a subquery to determine its score, e.g. aggregate the points of all the tasks it has solved.
In SQLAlchemy, this would look like this:
score = session.query(func.sum(task.points)).filter(task.thing_id == Thing.id).correlate(Thing).label("score")
thing_ranking = session.query(thing, score).order_by(desc("score"))
This is somewhat a little bit more advanced usage of SQLAlchemy: We construct a subquery that returns a scalar value we labled score. With correlate we tell it that thing will come from an outer query (this is important).
So that was the case where you run a single query that gives you a ranking (the ranks a determined based on the index in the list and depend on your ranking strategy). If you can achieve this, it is the best case
The query itself is expensive you want the values cached. This means you can either use the solution above and cache the values outside of the database (e.g. in a dict or using a caching library). Or you compute them like above but update a database field (like Thing.rank). Again, the query from above gives us the ranking. Additionally, I assume the simplest kind of ranking: the index denotes the rank:
for rank, (thing, score) in enumerate(thing_ranking):
thing.rank = rank
Notice how I base my rank based on the index using enumerate. Additionally, I take advantage of the fact that since I just queried thing, I already have it in the session, so no need for an extra query. So this might be your solution right here, but read on for some additional info.
Using the last idea from above, we can now tackle 1): Get the query outside the loop. In general I noticed that you pass a list of things to a sorting function that only seems to return IDs. Why? If you can change it, make it so that it returns the things as a whole.
However, it might be possible that you cannot change this function so let's consider what we do if we can't change it. We already have a list of all relevant things. And we get a sorted list of their IDs. So why not build a dict as a lookup for ID -> Thing?
things_dict = dict(thing.id, thing for thing in lock_things)
We can use this dict instead of querying inside the loop:
for prediction, id in tups:
thing = things_dict[id]
However, it may be possible (for some reason I missed in your example) that not all IDs were returned previously. In that case (or in general) you can take advantage of a similar mapping SQLAlchemy keeps itself: You can ask it for a primary key and it will not query the database if it already has it:
for prediction, id in tups:
thing = session.query(Thing).get(id)
So that way we have reduced the problem and only execute queries for objects we don't already have.
One last thing: What if we don't have most of the things? Then I didn't solve your problem, I just replaced the query. In that case, you will have to create a new query that fetches all the elements you need. In general this depends on the source of the IDs and how they are determined, but you could always go the least efficient way (which is still way faster than inside-loop queries): Using SQL's IN:
all_things = session.query(Thing).filter(Thing.group_id == 4).filter(Thing.id.in_([id for _, id in tups]).all()
This would construct a query that filters with the IN keyword. However, with a large list of things this is terribly inefficient and thus if you are in this case, it is most likely better you construct some more efficient way in SQL that determines if this is an ID you want.
Summary
So this was a long text. So sum up:
Perform queries in SQL as much as possible if you can write it efficiently there
Use SQLAlchemy's awesomeness to your advantage, e.g. create subqueries
Try to never execute queries inside a loop
Create some mappings for yourself (or use that of SQLAlchemy to your advantage)
Do it the pythonic way: Keep it simple, keep it explicit.
One final thought: If your queries get really complex and you fear you loose control over the queries executed by the ORM, drop it and use the Core instead. It is almost as awesome as the ORM and gives you huge amounts of control over the queries as you build them yourselves. With this you can construct almost any SQL query you can think of and I am certain that the batch updates you mentioned are also possible here (If you see that my queries above lead to many UPDATE statements, you might want to use the Core).

Categories

Resources