I've been implementing superset at work, and I like it so far. However, I have such a table:
name,age,gender
John,42,M
Sally,38,F
Patricia,27,F
Steven,29,M
Amanda,51,F
I want to define a new metric against each name, counting the number of people who are younger. My data is in a MySQLdatabase, and I suppose that for one person, I could write the query thus:
SELECT COUNT(DISTINCT name) from users where users.age <= 42;
for, say, John's row. So, how do I do this continuously for the entire table?
Your query could look something like
select *,
(select count(distinct all_users.name) from users all_users where all_users.age <= users.age)
FROM users
To shadow's point - this would get quite expensive to run on a large dataset.
If that were the case, you'd probably want to try putting an index on age, or denormalize that count altogether - the tradeoff being that inserts would become slower.
Related
I am writing a Python script that will be run regularly in a production environment where efficiency is key.
Below is an anonymized query that I have which pulls sales data for 3,000 different items.
I think I am getting slower results querying for all of them at once. When I try querying for different sizes, the amount of time it takes varies inconsistently (likely due to my internet connection). For example, sometimes querying for 1000 items 3 times is faster than all 3000 at once. However, running the same test 5 minutes later gets me different results. It is a production database where performance may be dependent on current traffic. I am not a database administrator but work in data science, using mostly similar select queries (I do the rest in Python).
Is there a best practice here? Some sort of logic that determines how many items to put in the WHERE IN clause?
date_min = pd.to_datetime('2021-11-01')
date_max = pd.to_datetime('2022-01-31')
sql = f"""
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily
WHERE
product_code IN {tuple(item_list)}
and sales_date >= DATE('{date_min}')
and sales_date <= DATE('{date_max}')
ORDER BY
sales_date DESC, revenue
"""
df_act = pd.read_sql(sql, di.db_engine)
df_act
If your sales_date column is indexed in the database, I think using a function in the where clause (DATE) might cause the plan to not use that index. I believe you will have better luck if you concatenate date_min and date_max as strings (YYYY-MM-DD) into the SQL string and get rid of the function. Also, use BETWEEN...AND rather than >= ... AND ... <=.
As for IN with 1000 items, strongly recommend you don't do that. Create a single-column temp table of those values and index the item, then join to product_code.
Generally, something like this:
DROP TABLE IF EXISTS _item_list;
CREATE TEMP TABLE _item_list
AS
SELECT item
FROM VALUES (etc) t(item);
CREATE INDEX idx_items ON _item_list (item);
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily x
INNER JOIN _item_list y ON x.product_code = y.item
WHERE
sales_date BETWEEN '{date_min}' AND '{date_max}'
ORDER BY
sales_date DESC, revenue
As an addendum, try to have the items in the item list in the same order as the index on the product_code.
I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)
I have a cronjob (J1) which calculate ~1M customers' product category preference every night. Most customers' preference are stable. But there are exceptions and there are new customers every day. I want to know these changes by setting a "diff" bit to 1. Then another cronjob (J2) can do something (e.g. send notification to them) on such customers and set them back to 0.
The table looks like:
CREATE TABLE customers (
customer_id VARCHAR(255),
preference VARCHAR(255),
diff TINYINT(1),
PRIMARY KEY (customer_id),
KEY (diff)
);
AFAIK, INSERT .. ON DUPLICATE KEY doesn't know about whether a non-key value is different. So you can't use something similar to the following, right?
INSERT customers AS ("sql for J1") ON DUPLICATE KEY
_AND_PREFERENCE_DIFFERS_ SET diff=1;
So what's the best way to do it?
a) Rename table customers to customer_yesterday. Create a new table customers by running J1. LEFT JOIN two tables and set diff bit of customers. (Pros: faster? Cons: need to handle all diffs correctly, e.g. cases when a customer doesn't present in today's output)
b) Loop through output of J1 (using python mysql connector), query customer by customer_id, and insert only when value is different or it's a new customer. (Pros: easy to understand logic; Cons: slow?)
Any better solutions?
Update:
As #Barmar asked, let's say sql for J1 is a transaction grouping sql, e.g.
SELECT
customer_id,
GROUP_CONCAT(DISTINCT product_category SEPARATOR ',')
FROM transaction
WHERE date between _30_days_ago_ and _today_;
Make SQL for J1 a query that uses a LEFT JOIN to filter out customers whose preference hasn't changed.
INSERT INTO customers (customer_id, preference)
SELECT t1.*
FROM (
SELECT customer_id,
GROUP_CONCAT(DISTINCT product_category ORDER BY product_category SEPARATOR ',') AS preference
FROM transaction
WHERE date BETWEEN _30_days_ago_ AND _today_) AS t1
LEFT JOIN customers AS c ON t1.customer_id = c.customer_id AND t1.preference = c.preference
WHERE t1.customer_id IS NULL
ON DUPLICATE KEY UPDATE preference = VALUES(preference), diff = 1
I've added an ORDER BY option to GROUP_CONCAT so that it will always return the categoris in a consistent order. Otherwise, it may result in false positives when the order changes.
I feel obliged to point out that storing comma-separated values in a table column is generally poor database design. You should use a many-to-many relationship table instead.
I have a general ledger table in my DB with the columns: member_id, is_credit and amount. I want to get the current balance of the member.
Ideally that can be got by two queries where the first query has is_credit == True and the second query is_credit == False something close to:
credit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==True)
debit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==False)
balance = credit_amount - debit_amount
and then subtract the result. Is there a way to have the above run in one query to give the balance?
From the comments you state that hybrids are too advanced right now, so I will propose an easier but not as efficient solution (still its okay):
(session.query(Funds.is_credit, func.sum(Funds.amount).label('Debit_Amount')).
filter(Funds.member_d==member_id).group_by(Funds.is_credit))
What will this do? You will recieve a two-row result, one has the credit, the other the debit, depending on the is_credit property of the result. The second part (Debit_Amount) will be the value. You then evaluate them to get the result: Only one query that fetches both values.
If you are unsure what group_by does, I recommend you read up on SQL before doing it in SQLAlchemy. SQLAlchemy offers very easy usage of SQL but it requires that you understand SQL as well. Thus, I recommend: First build a query in SQL and see that it does what you want - then translate it to SQLAlchemy and see that it does the same. Otherwise SQLAlchemy will often generate highly inefficient queries, because you asked for the wrong thing.
Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.