Order by sometimes not work in my query

Order by sometimes not work in my query - python

I defined two classes:
class OrderEntryVacancyRenew(OrderEntry):
...
vacancy_id = db.Column(db.Integer, db.ForeignKey('vacancy.id'), nullable=False)
vacancy = db.relationship('Vacancy')
remaining = db.Column(db.SmallInteger)
class Vacancy(db.Model):
id = db.Column(db.Integer, autoincrement=True, primary_key=True)
renew_at = db.Column(TZDateTime, index=True)
Then I defined the method to refresh OrderEntryVacancyRenew.remaining and Vacancy.renew_at.
def renew_vacancy():
filters = [
OrderEntryVacancyRenew.remaining,
Vacancy.status == 0,
or_(
Vacancy.renew_at <= (utcnow() - INTERVAL),
Vacancy.renew_at.is_(None))
]
renew_vacancies = OrderEntryVacancyRenew.query.options(
load_only('remaining', 'vacancy_id')
).order_by(
OrderEntryVacancyRenew.id
).from_self().group_by(
OrderEntryVacancyRenew.vacancy_id
).join(
OrderEntryVacancyRenew.vacancy
).options(
contains_eager(OrderEntryVacancyRenew.vacancy).load_only('renew_at')
).filter(*filters)
for entry in renew_vacancies:
entry.vacancy.renew_at = utcnow()
entry.remaining -= 1
db.session.commit()
I wrote the unit test to check renew_vacancy
vacancy1 = Vacancy(id=10000)
vacancy2 = Vacancy(id=10001)
db.session.add_all([vacancy1, vacancy2])
vacancy_renew1 = OrderEntryVacancyRenew(
vacancy_id=vacancy1.id,
remaining=24)
# make sure vacancy_renew1.id < vacancy_renew2.id
db.session.add(vacancy_renew1)
db.session.commit()
vacancy_renew2 = OrderEntryVacancyRenew(
vacancy_id=vacancy1.id,
remaining=8)
vacancy_renew3 = OrderEntryVacancyRenew(
vacancy_id=vacancy2.id,
remaining=42)
db.session.add_all((vacancy_renew2, vacancy_renew3))
db.session.commit()
renew_vacancy()
self.assertEqual(
(vacancy_renew1.remaining, vacancy_renew2.remaining), (23, 8))
renew_vacancies is order by OrderEntryVacancyRenew id and group by Vacancy id, so I expect it will filter vacancy_renew1 and vacancy_renew3.
I used the following command to run the unit test 100 times:
for i in `seq 1 100`; do python test.py; done
In some rare situations, it filters vacancy_renew2 instead of vacancy_renew1.
Why does it happen that sometimes order by does not work as expected?
I try to print vacancy_renew1.id and vacancy_renew2.id after renew_vacancy.
...
db.session.commit()
renew_vacancy()
print vacancy_renew1.id
print vacancy_renew2.id
self.assertEqual(
(vacancy_renew1.remaining, vacancy_renew2.remaining), (23, 8))
...

Why does it happen that sometimes ORDER BY does not work as expected?
Given standard SQL, the results of your query are indeterminate, so it is not very valuable to know why it works most of the time and fails rarely. There are two things that make the results vary:
Generally you should not rely on the order of rows of a subquery in enclosing queries, even if you apply an ordering. Some database implementations may have additional guarantees, but on others for example optimizers may deem the ORDER BY unnecessary – which MySQL 5.7 and up does, it removes the subquery entirely.
Usually databases, such as SQLite and MySQL1, that allow selecting non-aggregate items that are not functionally dependent on, or are not named in the GROUP BY clause, leave it unspecified from which row in the group the values are taken:
SQLite:
... Otherwise, it is evaluated against a single arbitrarily chosen row from within the group. If there is more than one non-aggregate expression in the result-set, then all such expressions are evaluated for the same row.
MySQL:
... In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause.
Trying out the query on SQLite failed the assertion on this machine, while on MySQL it passed. This is probably due to implementation of selecting the row from within the group, but you should not rely on such details.
What you seem to be after is a greatest-n-per-group query, or top-1. Not knowing which database you are using here's a somewhat generic way to do just that using an EXISTS subquery expression:
renew_alias = db.aliased(OrderEntryVacancyRenew)
renew_vacancies = db.session.query(OrderEntryVacancyRenew).\
join(OrderEntryVacancyRenew.vacancy).\
options(
load_only('remaining'),
contains_eager(OrderEntryVacancyRenew.vacancy).load_only('renew_at')).\
filter(db.not_(db.exists().where(
db.and_(renew_alias.vacancy_id == OrderEntryVacancyRenew.vacancy_id,
renew_alias.id < OrderEntryVacancyRenew.id)))).\
filter(*filters)
This query passes the assertion on both SQLite and MySQL. Alternatively you could replace the EXISTS subquery expression with a LEFT JOIN and IS NULL predicate.
P.s. I suppose you're using some flavour of MySQL and following advice such as this. You should read the commentary on that one as well, since there are many people rightly pointing out the pitfalls. It does not work, at least for MySQL 5.7 and up.
1: Controllable with the ONLY_FULL_GROUP_BY SQL mode setting, enabled by default in MySQL 5.7.5 and up.

Related

How does set_group_by works in Django?

I was writing the following query:
claim_query = ClaimBillSum.objects.filter(claim__lob__in = lobObj)\
.annotate(claim_count = Count("claim__claim_id", distinct=True))\
.annotate(claim_bill_sum = Sum("bill_sum"))\
.values("claim__body_part", "claim_count", "claim_bill_sum")\
.order_by("claim__body_part")
When I checked the query property, it was grouped by all properties of the tables related in this query, not only the ones selected in the values() function, when I only wanted to group by claim__body_part.
As I searched for a way to change the group by instruction, I found the query.set_group_by() function, that when applied, fixed the query in the way I wanted:
claim_query.query.set_group_by()
SELECT
"CLAIM"."body_part",
COUNT(DISTINCT "claim_bill_sum"."claim_id") AS "claim_count",
SUM("claim_bill_sum"."bill_sum") AS "claim_bill_sum"
FROM
"claim_bill_sum"
INNER JOIN "CLAIM" ON
("claim_bill_sum"."claim_id" = "CLAIM"."claim_id")
WHERE
"CLAIM"."lob_id" IN (SELECT U0."lob_id" FROM "LOB" U0 WHERE U0."client_id" = 1)
GROUP BY
"CLAIM"."body_part"
ORDER BY
"CLAIM"."body_part" ASC
But I couldn't find any information in Django documentation or anywhere else to better describe how this function works. Why the default group by is selecting all properties, and how .set_group_by() works, selecting exactly the property I wanted?

Read optimisation cassandra using python

I have a table with the following model:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
pseudo text,
importance float,
is_friend_following bigint,
is_friend boolean,
is_following boolean,
PRIMARY KEY ((user_id), is_friend_following)
);
I also have a table containing my seeds. Those (20) users are the starting point of my graph. So I select their ID and search in the table above to get their Followers and friends, and from there I build my graph (networkX).
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
friend_ids = []
follower_ids = []
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
friend_ids.append(row.friend_follower_id)
if row.is_follower:
follower_ids.append(row.friend_follower_id)
if friend_ids:
for friend_id in friend_ids:
obj.graph.add_edge(seed, friend_id)
if follower_ids:
for follower_id in follower_ids:
obj.graph.add_edge(follower_id, seed)
return obj
The problem is that the time it takes to build the graph is too long and I would like to optimize it.
I've got approximately 5 millions rows in my table 'network_table'.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?

I suspect the real issue may not be the queries but rather the processing time.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
There should not be any problem with doing a single query on the whole table if you enable paging (https://datastax.github.io/python-driver/query_paging.html - using fetch_size). Cassandra will return up to the fetch_size and will fetch additional results as you read them from the result_set.
Please note that if you have many rows in the table that are non seed related then a full scan may be slower as you will receive rows that will not include a "seed"
Disclaimer - I am part of the team building ScyllaDB - a Cassandra compatible database.
ScyllaDB have published lately a blog on how to efficiently do a full scan in parallel http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/ which applies to Cassandra as well - if a full scan is relevant and you can build the graph in parallel than this may help you.

It seems like you can get rid of the last 2 if statements, since you're going through data that you already have looped through once:
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
obj.graph.add_edge(seed, row.friend_follower_id)
elif row.is_follower:
obj.graph.add_edge(row.friend_follower_id, seed)
return obj
This also gets rid of many append operations on lists that you're not using, and should speed up this function.

Querying objects using attribute of member of many-to-many

I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]

You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)

Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.

Filter query by linked object key in SQLAlchemy

Judging by the title this would be the exact same question, but I can't see how any of the answers are applicable to my use case:
I have two classes and a relationship between them:
treatment_association = Table('tr_association', Base.metadata,
Column('chronic_treatments_id', Integer, ForeignKey('chronic_treatments.code')),
Column('animals_id', Integer, ForeignKey('animals.id'))
)
class ChronicTreatment(Base):
__tablename__ = "chronic_treatments"
code = Column(String, primary_key=True)
class Animal(Base):
__tablename__ = "animals"
treatment = relationship("ChronicTreatment", secondary=treatment_association, backref="animals")
I would like to be able to select only the animals which have undergon a treatment which has the code "X". I tried quite a few approaches.
This one fails with an AttributeError:
sql_query = session.query(Animal.treatment).filter(Animal.treatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
And this one happily returns a list of unfiltered animals:
sql_query = session.query(Animal.treatment).filter(ChronicTreatment.code == "chrFlu")
for item in sql_query:
pass
mystring = str(session.query(Animal))
The closest thing to the example from the aforementioned thread I could put together:
subq = session.query(Animal.id).subquery()
sql_query = session.query(ChronicTreatment).join((subq, subq.c.treatment_id=="chrFlu"))
for item in sql_query:
pass
mystring = str(session.query(Animal))
mydf = pd.read_sql_query(mystring,engine)
Also fails with an AttributeError.
Can you hel me sort this list?

First, there are two issues with table definitions:
1) In the treatment_association you have Integer column pointing to chronic_treatments.code while the code is String column.
I think it's just better to have an integer id in the chronic_treatments, so you don't duplicate the string code in another table and also have a chance to add more fields to chronic_treatments later.
Update: not exactly correct, you still can add more fields, but it will be more complex to change your 'code' if you decide to rename it.
2) In the Animal model you have a relation named treatment. This is confusing because you have many-to-many relation, it should be plural - treatments.
After fixing the above two, it should be clearer why your queries did not work.
This one (I replaced treatment with treatments:
sql_query = session.query(Animal.treatments).filter(
Animal.treatments.code == "chrFlu")
The Animal.treatments represents a many-to-many relation, it is not an SQL Alchemy mode, so you can't pass it to the query nor use in a filter.
Next one can't work for the same reason (you pass Animal.treatments into the query.
The last one is closer, you actually need join to get your results.
I think it is easier to understand the query as SQL (and you anyway need to know SQL to be able to use sqlalchemy):
animals = session.query(Animal).from_statement(text(
"""
select distinct animals.* from animals
left join tr_association assoc on assoc.animals_id = animals.id
left join chronic_treatments on chronic_treatments.id = assoc.chronic_treatments_id
where chronic_treatments.code = :code
""")
).params(code='chrFlu')
It will select animals and join chronic_treatments through the tr_association and filter the result by code.
Having this it is easy to rewrite it using SQL-less syntax:
sql_query = session.query(Animal).join(Animal.treatments).filter(
ChronicTreatment.code == "chrFlu")
That will return what you want - a list of animals who have related chronic treatment with given code.

'Don't care' for a column in SQLite queries?

I've got a SQLite query, which depends on 2 variables, gender and hand. Each of these can have 3 values, 2 which actually mean something (so male/female and left/right) and the third is 'all'. If a variable has a value of 'all' then I don't care what the particular value of that column is.
Is it possible to achieve this functionality with a single query, and just changing the variable? I've had a look for a wildcard or don't care operator but haven't been able to find any except for % which doesn't work in this situation.
Obviously I could make a bunch of if statements and have different queries to use for each case but that's not very elegant.
Code:
select_sql = """ SELECT * FROM table
WHERE (gender = ? AND hand = ?)
"""
cursor.execute(select_sql, (gender_var, hand_var))
I.e. this query works if gender_val = 'male' and hand_var = 'left', but not if gender_val or hand_var = 'all'

You can indeed do this with a single query. Simply compare each variable to 'all' in your query.
select_sql = """ SELECT * FROM table
WHERE ((? = 'all' OR gender = ?) AND (? = 'all' OR hand = ?))
"""
cursor.execute(select_sql, (gender_var, gender_var, hand_var, hand_var))
Basically, when gender_var or hand_var is 'all', the first part of each OR expression is always true, so that branch of the AND is always true and matches all records, i.e., it is a no-op in the query.
It might be better to build a query dynamically in Python, however, that has just the fields you actually need to test. It might be noticeably faster, but you'd have to benchmark that to be sure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.