So I am using Aurora MYSQL DB, and my AWS Lambda instance needs to do the following.
Assume a table with two columns, ID, and Translated ID.
I have acess to a Lambda function, which takes the ID as input, and outputs the Translated ID. It can also take a list of IDs, and give back list of translated IDs.
The problem is right now, I am doing it row by row with the workflow as:
1. get top 100 Rows from table, where translated ID is null,
2. for each row, retrieve the ID, use the API to get the translated ID.
3. Update the row with the translated id.
4. rinse and repeat for all 100 rows.
The problem is due to the latency of involving the api in between, the row by row operaton is causing the lambda function to timeout. Is there any way to do a batch operation, while still aligning the translated IDS, vertically with the corresponding IDs?. Something like:
get top 100 IDS from table, where translated ID is null.
Use the API to take the list of all 100 IDS, and get a corresponding list of 100 translated IDs.
Pefro, (in one single update command preferably) update all the 100 ID rows, with their corresponding Translated-id column.
4 queries:
(0). ensure the environment is clean (you can omit this one if you are never reusing a database connection).
DROP TEMPORARY TABLE IF EXISTS my_updates;
(1). Create a temp table to hold the new values.
CREATE TEMPORARY TABLE my_updates (
id INT NOT NULL,
translated_id INT NOT NULL,
PRIMARY KEY(id)
);
(2). Insert all the new values in a bulk insert.
INSERT INTO my_updates (id, translated_id)
VALUES (?,?), (?,?), (?,?), ...
Repeat (?,?) × 100. Pass an array of 200 elements to this query. Some MySQL libraries have shortcuts for multiple row inserts, others you need to build the row parameter placeholder sets.
(3). You now have all 100 new tuples on the server, so you can ask it to update... join.
UPDATE base_table b
JOIN my_updates m ON m.id = b.id
SET b.translated_id = m.translated_id;
You could also do this in one query, though it is a little more convoluted:
UPDATE base_table
SET translated_id = CASE id
WHEN #i1 THEN #ti1
WHEN #i2 THEN #ti2
...
WHEN #i100 THEN #ti100
ELSE translated_id END
WHERE id IN (#i1,#i2,...#i100);
I've used #value here as placeholders to explain what goes where, since it would be less intuitive than the example above, but this query should actually be done with ? placeholders as well. The argument passed would be an array of 300 members, with 100 sets of (id,translated_id) and then all of the (id) values again for the WHERE. The ELSE is a safety precaution... it should never actually be reached, but no data will be overwritten if it is.
Related
I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.
I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:
Add new records to the database table
Compare the old set of data with the new set of data and update the table where necessary
Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
Add any new columns of data to the database for all records
I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:
rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
UPDATE tracking.test_table
SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
FROM temptable
WHERE temptable.app_id = tracking.test_table.app_id;
""");
cursor.rowcount
conn.commit()
cursor.close()
conn.close()
However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.
I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.
I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks
A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).
As you add your record to the table you do 3 things:
The New record's from-effective-date gets the transaction file's date
The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.
This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:
If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.
The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.
If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.
You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.
This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).
I can think of two likely candidates to add to your natural key: date, or batch.
Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.
If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.
Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.
I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/
You do NOT want this:
CREATE TABLE test_table (
app_id INTEGER PRIMARY KEY,
date DATE,
active BOOLEAN
);
Instead, you want this:
CREATE TABLE test_table (
app_id INTEGER,
date DATE,
active BOOLEAN,
PRIMARY KEY (app_id, date)
);
I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)
For step 2)
First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.
For step 3)
For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.
You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.
I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)
I am writing python scripts to sychronize tables from a MSSQL database to a Postgresql DB. The original author tends to use super wide tables with a lot of regional consecutive NULL holes in them.
For insertion speed, I serialized the records in bulk to string in the following form before execute()
INSERT INTO A( {col_list} )
SELECT * FROM ( VALUES (row_1), (row_2),...) B( {col_list} )
During the row serialization, its not possbile to determin the data type of NULL or None in python. This makes the job complicated. All NULL values in timestamp columns, integer columns etc need explicit type cast into proper types, or Pg complains about it.
Currently I am checking the DB API connection.description property and compare column type_code, for every column and add type casting like ::timestamp as needed.
But this feels cumbersome, with the extra work: the driver already converted the data from text to proper python data type, now I have to redo it for column with those many Nones.
Is there any better way to work around this with elegancy & simplicity ?
If you don't need the SELECT, go with #Nick's answer.
If you need it (like with a CTE to use the input rows multiple times), there are workarounds depending on the details of your use case.
Example, when working with complete rows:
INSERT INTO A -- complete rows
SELECT * FROM (
VALUES ((NULL::A).*), (row_1), (row_2), ...
) B
OFFSET 1;
{col_list} is optional noise in this particular case, since we need to provide complete rows anyway.
Detailed explanation:
Casting NULL type when updating multiple rows
Instead of inserting from a SELECT, you can attach a VALUES clause directly to the INSERT, i.e.:
INSERT INTO A ({col_list})
VALUES (row_1), (row_2), ...
When you insert from a query, Postgres examines the query in isolation when trying to infer the column types, and then tries to coerce them to match the target table (only to find out that it can't).
When you insert directly from a VALUES list, it knows about the target table when performing the type inference, and can then assume that any untyped NULL matches the corresponding column.
You could try to create json from data and then rowset from json using json_populate_record(..).
postgres=# create table js_test (id int4, dat timestamp, val text);
CREATE TABLE
postgres=# insert into js_test
postgres-# select (json_populate_record(null::js_test,
postgres(# json_object(array['id', 'dat', 'val'], array['5', null, 'test']))).*;
INSERT 0 1
postgres=# select * from js_test;
id | dat | val
----+-----+------
5 | | test
You can use json_populate_recordset(..) to do the same with multiple rows in one go. You just pass json value that is array of json. Make sure it isn't array of json.
So this is OK: '[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]'::json
This is not: array['{"id":1,"dat":null,"val":6}'::json,'{"id":3,"val":"tst"}'::json]
select *
from json_populate_recordset(null::js_test,
'[{"id":1,"dat":null,"val":6},{"id":3,"val":"tst"}]')
I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)
I am a beginner in mysql and may be its my fault somewhere, and not able to understand how this can be resolved.
This is structure of my table:-
CREATE TABLE `nearest_product_type` (
`id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
`created` datetime NOT NULL,
`modified` datetime NOT NULL,
`name` varchar(15) NOT NULL UNIQUE
)
;
And this is the code i am trying:-
base = MySQLdb.connect (host="localhost", user = "root", passwd = "sheeshmohsin", db="points")
basecursor = base.cursor()
queryone = """INSERT INTO nearest_product_type (name,created,modified) VALUES (%s,%s,%s) ON DUPLICATE KEY UPDATE name=name """
category = "Indica"
valueone = (category,datetime.datetime.now(),datetime.datetime.now())
basecursor.execute(queryone, valueone)
product_id = basecursor.lastrowid
basecursor.close()
base.commit()
base.close()
print product_id
On running this python script, first time when category is not unique, it works fine, but on running again with the same category as first time, last row id returns 0. but i need the id of the last row which is updated.
And when i checked the rows in table, the auto-increment is also working, suppose if i run the script four times, in first time when category is unique the id is 1 and suppose another unique category comes in fourth time, then the id assigned to this row is 4, but it should be 2, because its second row. how can i solve this?
The ON DUPLICATE KEY UPDATE part here will not work as the key is the auto-increment column, which will never get duplicates.
It is almost certainly this clause that is causing the unexpected counts, particularly given the UNIQUE setting on name.
You can try using something like SELECT MAX(id) FROM nearest_product_type to get the last id added.
Something is wrong in the way you access the database. When you try to insert an new row in your database with a name that already exists, as the column name is declared to be unique, the insert will fail.
If you want to modify an existing row , you must use an UPDATE statement not an INSERT one. And there's nothing in SQL to do an insert or update.
And nothing in autoincrement guarantees that id are consecutive. All you know is that the database will allow a different id for each inserted row, but insertion failure can (and do in you case) result is holes is the id sequence.
Furthermore, some drivers may allow for pre-reservation of ids, particurarly with network connections to allow a client connection to get a bunch of ids in case it would insert more than one row. It that case, if another client asks for ids, and both clients insert rows alternatively, the id will not follow the insertion time.