Deduping a table while keeping record structures

Deduping a table while keeping record structures - python

I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.

Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*

Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)

Related

Postres/Python - is it better to use one large query, or several smaller ones? How do you decide number of items to include in 'WHERE IN' clause?

I am writing a Python script that will be run regularly in a production environment where efficiency is key.
Below is an anonymized query that I have which pulls sales data for 3,000 different items.
I think I am getting slower results querying for all of them at once. When I try querying for different sizes, the amount of time it takes varies inconsistently (likely due to my internet connection). For example, sometimes querying for 1000 items 3 times is faster than all 3000 at once. However, running the same test 5 minutes later gets me different results. It is a production database where performance may be dependent on current traffic. I am not a database administrator but work in data science, using mostly similar select queries (I do the rest in Python).
Is there a best practice here? Some sort of logic that determines how many items to put in the WHERE IN clause?
date_min = pd.to_datetime('2021-11-01')
date_max = pd.to_datetime('2022-01-31')
sql = f"""
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily
WHERE
product_code IN {tuple(item_list)}
and sales_date >= DATE('{date_min}')
and sales_date <= DATE('{date_max}')
ORDER BY
sales_date DESC, revenue
"""
df_act = pd.read_sql(sql, di.db_engine)
df_act

If your sales_date column is indexed in the database, I think using a function in the where clause (DATE) might cause the plan to not use that index. I believe you will have better luck if you concatenate date_min and date_max as strings (YYYY-MM-DD) into the SQL string and get rid of the function. Also, use BETWEEN...AND rather than >= ... AND ... <=.
As for IN with 1000 items, strongly recommend you don't do that. Create a single-column temp table of those values and index the item, then join to product_code.
Generally, something like this:
DROP TABLE IF EXISTS _item_list;
CREATE TEMP TABLE _item_list
AS
SELECT item
FROM VALUES (etc) t(item);
CREATE INDEX idx_items ON _item_list (item);
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily x
INNER JOIN _item_list y ON x.product_code = y.item
WHERE
sales_date BETWEEN '{date_min}' AND '{date_max}'
ORDER BY
sales_date DESC, revenue
As an addendum, try to have the items in the item list in the same order as the index on the product_code.

Python, SQL: How to update multiple rows and columns in a single trip around the database?

Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.

If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?

You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.

Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

How to get num results in mysqldb

I have the following query:
self.cursor.execute("SELECT platform_id_episode, title, from table WHERE asset_type='movie'")
Is there a way to get the number of results returned directly? Currently I am doing the inefficient:
r = self.cursor.fetchall()
num_results = len(r)

If you don't actually need the results,* don't ask MySQL for them; just use COUNT:**
self.cursor.execute("SELECT COUNT(*) FROM table WHERE asset_type='movie'")
Now, you'll get back one row, with one column, whose value is the number of rows your other query would have returns.
Notice that I ignored your specific columns and just did COUNT(*). A COUNT(platform_id_episode) would also be legal, but it means the number of found rows with non-NULL platform_id_episode values; COUNT(*) is the number of found rows full stop.***
* If you do need the results… well, you have to call fetchall() or equivalent to get them, so I don't see the problem.
** If you've never used aggregate functions in SQL before, make sure to look over some of the examples on that page; you've probably never realized you can do things like that so simply (and efficiently).
*** If someone taught you "never use * in a SELECT", well, that's good advice, but it's not relevant here. The problem with SELECT * is that it spams all of the columns, in random order, across your result set, instead of the columns you actually need in the order you need. SELECT COUNT(*) doesn't do that.

SQLalchemy distinct, order_by different column

I am querying two table with SQLalchemy, I want to use the distinct feature on my query, to get a unique set of customer id's
I have the following query:
orders[n] = DBSession.query(Order).\
join(Customer).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
order_by(Customer.id).\
distinct(Customer.id).\
order_by(asc(Order.position)).all()
If you can see what is going on here, I am querying the Order table for all orders out for a specific date, for a specific vehicle, this works fine. However some customers may have more than one order for a single date. So I am trying to filter the results to only list each customer once. This work fine, however In order to do this, I must first order the results by the column that has the distinct() function on it. I can add in a second order_by to the column I want the results ordered by, without causing a syntax error. But it gets ignored and results are simply ordered by the Customer.id.
I need to perform my query on the Order table and join to the customer (not the other way round) due to the way the foreign keys have been setup.
Is what I want to-do possible within one query? Or will I need to re-loop over my results to get the data I want in the right order?

you never need to "re-loop" - if you mean load the rows into Python, that is. You probably want to produce a subquery and select from that, which you can achieve using query.from_self().order_by(asc(Order.position)). More specific scenarios you can get using subquery().
In this case I can't really tell what you're going for. If a customer has more than one Order with the requested vehicle id and date, you'll get two rows, one for each Order, and each Order row will refer to the Customer. What exactly do you want instead ? Just the first order row within each customer group ? I'd do that like this:
highest_order = s.query(Order.customer_id, func.max(Order.position).label('position')).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
group_by(Order.customer_id).\
subquery()
s.query(Order).\
join(Customer).\
join(highest_order, highest_order.c.customer_id == Customer.id).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
filter(Order.position == highest_order.c.position)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deduping a table while keeping record structures - python

Related

Postres/Python - is it better to use one large query, or several smaller ones? How do you decide number of items to include in 'WHERE IN' clause?

Python, SQL: How to update multiple rows and columns in a single trip around the database?

Get All of Single Column from Every Table in Schema

How to get num results in mysqldb

SQLalchemy distinct, order_by different column

Categories

Resources