Effective batch "update-or-insert" in SqlAlchemy - python

There exists a table Users and in my code I have a big list of User objects. To insert them I can use :
session.add_all(user_list)
session.commit()
The problem is that there can be several duplicates which I want to update but the database wont allow to insert duplicate entries. For sure, I can iterate over user_list and try to insert user in the database and if it fails - update it :
for u in users:
q = session.query(T).filter(T.fullname==u.fullname).first()
if q:
session.query(T).filter_by(index=q.index).update({column: getattr(u,column) for column in Users.__table__.columns.keys() if column!='id'})
session.commit()
else:
session.add(u)
session.commit()
but I find this solution quiet ineffective : first, I am making several requests to retrieve object q, and instead of batch inserting of new items I insert them one per one. I wonder if there exists a better solution for this task.
UPD better version:
for u in users:
q = session.query(T).filter(Users.fullname==u.fullname).first()
if q:
for column in Users.__table__.columns.keys():
if not column=='index':
setattr(q,column,getattr(u,column))
session.add(q)
else:
session.add(u)
session.commit()

a better solution would be to use
INSERT ... ON DUPLICATE KEY UPDATE ...
bulk MySQL construct (I assume you're using MySQL because your post is tagged with 'mysql'). This way you're both inserting new entries and updating existing ones in one statement / transaction, see http://dev.mysql.com/doc/refman/5.6/en/insert-on-duplicate.html
It's not ideal if you have multiple unique indexes and, depending on your schema, you'll have to fill in all NOT NULL values (hence issuing one bulk SELECT before calling it), but it's definitely the most efficient option and we use it a lot. The bulk version will look something like (let's assume name is a unique key):
INSERT INTO User (name, phone, ...) VALUES
('ksmith', '111-11-11', ...),
('jford', '222-22,22', ...),
...,
ON DUPLICATE KEY UPDATE
phone = VALUES(phone),
... ;
Unfortunately, INSERT ... ON DUPLICATE KEY UPDATE ... is not supported natively by SQLa so you'll have to implement a little helper function which will build the query for you.

Related

sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "product_pkey" [duplicate]

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

How to do ...ON DUPLICATE KEY UPDATE... in django

Is there a way to do the following in django's ORM?
INSERT INTO mytable
VALUES (1,2,3)
ON DUPLICATE KEY
UPDATE field=4
I'm familiar with get_or_create, which takes default values, but that doesn't update the record if there are differences in the defaults. Usually I use the following approach, but it takes two queries instead of one:
item = Item(id=1)
item.update(**fields)
item.save()
Is there another way to do this?
I'm familiar with get_or_create, which takes default values, but that doesn't update the record if there are differences in the defaults.
update_or_create should provide the behavior you're looking for.
Item.objects.update_or_create(
id=1,
defaults=fields,
)
It returns the same (object, created) tuple as get_or_create.
Note that this will still perform two queries, but only in the event the record does not already exist (as is the case with get_or_create). If that is for some reason unacceptable, you will likely be stuck writing raw SQL to handle this, which would be unfortunate in terms of readability and maintainability.
I think get_or_create() is still the answer, but only specify the pk field(s).
item, _ = Item.objects.get_or_create(id=1)
item.update(**fields)
item.save()
Django 4.1 has added the support for INSERT...ON DUPLICATE KEY UPDATE query. It will update the fields in case the unique validation fails.
Example of above in a single query:
# Let's say we have an Item model with unique on key
items = [
Item(key='foobar', value=10),
Item(key='foobaz', value=20),
]
# this function will create 2 rows in a single SQL query
Item.objects.bulk_create(items)
# this time it will update the value for foobar
# and create new row for barbaz
# all in a single SQL query
items = [
Item(key='foobar', value=30),
Item(key='barbaz', value=50),
]
Item.objects.bulk_create(
items,
update_conflicts=True,
update_fields=['rate']
)

How to update multiple records using peewee

I'm using Peewee with Postgres database. I want to know how to update multiple records in a tabel at once?
We can perform this update in SQL using these commands, and I'm looking for a Peewee equivalent approach.
Yes, you can use the insert_many() function:
Insert multiple rows at once. The rows parameter must be an iterable
that yields dictionaries. As with insert(), fields that are not
specified in the dictionary will use their default value, if one
exists.
Example:
usernames = ['charlie', 'huey', 'peewee', 'mickey']
row_dicts = ({'username': username} for username in usernames)
# Insert 4 new rows.
User.insert_many(row_dicts).execute()
More details at: http://docs.peewee-orm.com/en/latest/peewee/api.html#Model.insert_many
ORMs usually dose not support bulk update and you have to use custom SQL, you can see samples in this link (db.excute_sql)

Python, SQL: How to update multiple rows and columns in a single trip around the database?

Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.
If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);

MySQLdb Python prevent duplicate and optimize muliple inserts

I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.

Categories

Resources