I'm using Dynamodb. I have a simple Employee table with fields like id, name, salary, doj, etc. What is the equivalent query of select max(salary) from employee in dynamodb?
You can model your schema something like:
employee_id, as partition-key
salary, as sort-key of the table or local secondary index.
Then, Query your table for given employee_id, with ScanIndexForward as false, and pick the first returned entry. Since all rows one employee_id are stored in sorted fashion, the first entry in desc order will be the one with highest salary.
You can also keep Limit as 1, in which case DynamoDB will return only one record.
Relevant documentation here.
Not sure about boto3 but inboto can be run this way
from boto.dynamodb2.table import Table
table = Table("employee")
values = list(table.query_2(reverse=True, limit=1))
MAXVALUE = values[0][salary]
There is no cheap way to achieve this in Dynamodb. There is no inbuild function to determine the max value of an attribute without retrieving all items and calculate programmatically.
Related
Say I have an id column that is saved as ids JSON NOT NULL using SQLAlchemy, and now I want to delete an id from this column. I'd like to do several things at once:
query only the rows who have this specific ID
delete this ID from all rows it appears in
a bonus, if possible - delete the row if the ID list is now empty.
For the query, something like this:
db.query(models.X).filter(id in list(models.X.ids)) should work.
now, I'd rather avoid iterating over each query and then send an update request as it can be multiple rows. Is there any elegant way to do this?
Thanks!
For the search and remove remove part you can use json_remove function (from SQLLite built-in functions)
from sqlalchemy import func
db.query(models.X).update({'ids': func.json_remove(models.X.ids,f'$[{TARGET_ID}]') })
Here replace TARGET_ID by the targeted id.
Now this will update the row 'silently' (wether or not this id is present in the array).
If you want to first check if target id is in the column: you can query first all rows containing the target id with json_extract query (calling .all() method and then remove those ids with an .update() call.
But this will cost you double amount of queries (less performant).
For the delete part, you can use the json_array_length built-in function
from sqlalchemy import func
db.query(models.X).filter(func.json_array_length(models.X.ids) == 0).delete()
FYI : Not sure that you can do both in one query, and even if possible, I would not do it for clean syntax, logging and monitoring reasons.
I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.
I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:
Add new records to the database table
Compare the old set of data with the new set of data and update the table where necessary
Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
Add any new columns of data to the database for all records
I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:
rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
UPDATE tracking.test_table
SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
FROM temptable
WHERE temptable.app_id = tracking.test_table.app_id;
""");
cursor.rowcount
conn.commit()
cursor.close()
conn.close()
However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.
I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.
I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks
A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).
As you add your record to the table you do 3 things:
The New record's from-effective-date gets the transaction file's date
The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.
This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:
If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.
The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.
If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.
You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.
This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).
I can think of two likely candidates to add to your natural key: date, or batch.
Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.
If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.
Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.
I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/
You do NOT want this:
CREATE TABLE test_table (
app_id INTEGER PRIMARY KEY,
date DATE,
active BOOLEAN
);
Instead, you want this:
CREATE TABLE test_table (
app_id INTEGER,
date DATE,
active BOOLEAN,
PRIMARY KEY (app_id, date)
);
I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)
For step 2)
First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.
For step 3)
For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.
You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.
I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)
I have a cronjob (J1) which calculate ~1M customers' product category preference every night. Most customers' preference are stable. But there are exceptions and there are new customers every day. I want to know these changes by setting a "diff" bit to 1. Then another cronjob (J2) can do something (e.g. send notification to them) on such customers and set them back to 0.
The table looks like:
CREATE TABLE customers (
customer_id VARCHAR(255),
preference VARCHAR(255),
diff TINYINT(1),
PRIMARY KEY (customer_id),
KEY (diff)
);
AFAIK, INSERT .. ON DUPLICATE KEY doesn't know about whether a non-key value is different. So you can't use something similar to the following, right?
INSERT customers AS ("sql for J1") ON DUPLICATE KEY
_AND_PREFERENCE_DIFFERS_ SET diff=1;
So what's the best way to do it?
a) Rename table customers to customer_yesterday. Create a new table customers by running J1. LEFT JOIN two tables and set diff bit of customers. (Pros: faster? Cons: need to handle all diffs correctly, e.g. cases when a customer doesn't present in today's output)
b) Loop through output of J1 (using python mysql connector), query customer by customer_id, and insert only when value is different or it's a new customer. (Pros: easy to understand logic; Cons: slow?)
Any better solutions?
Update:
As #Barmar asked, let's say sql for J1 is a transaction grouping sql, e.g.
SELECT
customer_id,
GROUP_CONCAT(DISTINCT product_category SEPARATOR ',')
FROM transaction
WHERE date between _30_days_ago_ and _today_;
Make SQL for J1 a query that uses a LEFT JOIN to filter out customers whose preference hasn't changed.
INSERT INTO customers (customer_id, preference)
SELECT t1.*
FROM (
SELECT customer_id,
GROUP_CONCAT(DISTINCT product_category ORDER BY product_category SEPARATOR ',') AS preference
FROM transaction
WHERE date BETWEEN _30_days_ago_ AND _today_) AS t1
LEFT JOIN customers AS c ON t1.customer_id = c.customer_id AND t1.preference = c.preference
WHERE t1.customer_id IS NULL
ON DUPLICATE KEY UPDATE preference = VALUES(preference), diff = 1
I've added an ORDER BY option to GROUP_CONCAT so that it will always return the categoris in a consistent order. Otherwise, it may result in false positives when the order changes.
I feel obliged to point out that storing comma-separated values in a table column is generally poor database design. You should use a many-to-many relationship table instead.
I have an items table that is related to an item_tiers table. The second table consists of inventory receipts for an item in the items table. There can be 0 or more records in the item_tiers table related to a single record in the items table. How can I, using query, get only records that have 1 or more records in item tiers....
results = session.query(Item).filter(???).join(ItemTier)
Where the filter piece, in pseudo code, would be something like ...
if the item_tiers table has one or more records related to item.
If there is a foreign key defined between tables, SA will figure the join condition for you, no need for additional filters.
There is, and i was really over thinking this. Thanks for the fast response. – Ominus
results = session.query(Item).join(ItemTier).filter(Item.foreign_key=ItemTier.column_with_keys).all()