SQL Alchemy rollback two transactions - python

I'm writing a python script to process some csv data and put it into a sqlite db which I'm accessing through sqlalchemy.
The calculations are currently implemented in two parts. The second part depends on results of part one already existing in the database. Rewriting the script from scratch do resolve this dependency would be a pain and I'd like to avoid it.
def part_one():
# does stuff
session.commit()
def part_two():
# does stuff, including querying part_one's results
# sometimes this function fails and rollbacks
session.commit()
If part_two fails, I want to rollback part_two AND part_one.
Since part_two depends on data existing in the db, I think I'm forced to commit in part_one. Otherwise I could just reuse the same session and rollback altogether obviously.
I tried messing about with session.begin_nested but didn't get anywhere with that. Is there a way to achieve what I'm trying to do? I need to either be able to session.query against uncommitted changes (that doesn't seem possible) or roll back a previously successfully committed transaction.

Ok I made this much more complicated than it need to be. What I was looking for was apparently session.flush which does all the insert/update/deletes of part_one without committing anything.
def part_one():
# does stuff
session.flush()
def part_two():
# does stuff, including querying part_one's results
# sometimes this function fails and rollbacks
session.commit()
Works like a charm

Related

Multiple write single read SQLite application with Peewee

I'm using an SQLite database with peewee on multiple machines, and I'm encountering various OperationalError, DataBaseError. It's obviously a problem of multithreading, but I'm not at all an expert with this nor with SQL. Here's my setup and what I've tried.
Settings
I'm using peewee to log machine learning experiments. Basically, I have multiple nodes (like, different computers) which run a python file, and all write to the same base.db file in a shared location. On top of that, I need a single read access from my laptop, to see what's going on. There are at most ~50 different nodes which instantiate the database and write things on it.
What I've tried
At first, I used the SQLite object:
db = pw.SqliteDatabase(None)
# ... Define tables Experiment and Epoch
def init_db(file_name: str):
db.init(file_name)
db.create_tables([Experiment, Epoch], safe=True)
db.close()
def train():
xp = Experiment.create(...)
# Do stuff
with db.atomic():
Epoch.bulk_create(...)
xp.save()
This worked fine, but I sometimes had jobs which crashed because of the database being locked. Then, I learnt that SQLite only handled one write operation per connection, which caused the problem.
So I turned to SqliteQueueDatabase as, according to the documentation, it's useful if "if you want simple read and write access to a SQLite database from multiple threads." I also added those keywords I found on other thread which were said to be useful.
The code then looked like this:
db = SqliteQueueDatabase(None, autostart=False, pragmas=[('journal_mode', 'wal')],
use_gevent=False,)
def init_db(file_name: str):
db.init(file_name)
db.start()
db.create_tables([Experiment, Epoch], safe=True)
db.connect()
and the same for saving stuff except for the db.atomic part. However, not only do write queries seem to encounter errors, I practically no longer have access to the database for read: it is almost always busy.
My question
What is the right object to use in this case? I thought SqliteQueueDatabase was the perfect fit. Are pooled database a better fit? I'm also asking this question because I don't know if I have a good grasp on the threading part: the fact that multiple database object are initialized from multiple machines is different from having a single object on a single machine with multiple threads (like this situation). Right? Is there a good way to handle things then?
Sorry if this question is already answered in another place, and thanks for any help! Happy to provide more code if needed of course.
Sqlite only supports a single writer at a time, but multiple readers can have the db open (even while a writer is connected) when using WAL-mode. For peewee you can enable wal mode:
db = SqliteDatabase('/path/to/db', pragmas={'journal_mode': 'wal'})
The other crucial thing, when using multiple writers, is to keep your write transactions as short as possible. Some suggestions can be found here: https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ under the "Transactions, Concurrency and Autocommit" heading.
Also note that SqliteQueueDatabase works well for a single process with multiple threads, but will not help you at all if you have multiple processes.
Inded, After #BoarGules comment, I realize that I confused two very different things:
Having multiple threads on a single machine: here, SqliteQueueDatabase is a very good fit
Having multiple machines, with one or more threads: that's basically how internet works.
So I ended up installing Postgre. A few links if it can be useful to people coming after me, for linux:
Install Postgre. You can build it from source if you don't have root privilege following chapter 17 from the official documentation, then Chapter 19.
You can export an SQLite database with pgloader. But again, if you don't have the right librairies and don't want to build everything, you can do it by hand. I did the following, not sure if more straightforward solution exist.
Export your tables as csv (following #coleifer's comment):
models = [Experiment, Epoch]
for model in models:
outfile = '%s.csv' % model._meta.table_name
with open(outfile, 'w', newline='') as f:
writer = csv.writer(f)
row_iter = model.select().tuples().iterator()
writer.writerows(row_iter)
Create the table in the new Postgre database:
db = pw.PostgresqlDatabase('mydb', host='localhost')
db.create_tables([Experiment, Epoch], safe=True)
Copy the CSV tables to Postgre db with the following command:
COPY epoch("col1", "col2", ...) FROM '/absolute/path/to/epoch.csv'; DELIMITER ',' CSV;
and likewise for the other tables.
IT worked fine for me, as I had only two tables. Can be annoying if you have more than that. pgloader seems a very good solution in that case, if you can install it easily.
Update
I could not create objects from peewee at first. I had integrity error: it seemed that the id which was returned by Postgre (with the RETURNING 'epoch'.'id' clause) was returning an already existing id. From my understanding, it was because the increment had not been called when using the COPY command. Thus, it only returned id 1, then 2, and so on until it reached an non existing id. To avoid going through all this failed creation, you can directly edit the iterator governing the RETURN clause, with:
ALTER SEQUENCE epoch_id_seq RESTART WITH 10000
and replace 10000 with the value from SELECT MAX("id") FROM epoch, +1.
I think you can just increase the timeout for sqlite and be fix your problem.
The issue here is that the default sqlite timeout for writing is low, and when there is even small amounts of concurrent writes, sqlite will start throwing exceptions. This is common and well known.
The default should be something like 5-10 seconds. If you exceed this timeout then either increase it or chunk up your writes to the db.
Here is an example:
I return a DatabaseProxy here because this proxy allows sqlite to be swapped out for postgres without changing client code.
import atexit
from peewee import DatabaseProxy # type: ignore
from playhouse.db_url import connect # type: ignore
from playhouse.sqlite_ext import SqliteExtDatabase # type: ignore
DB_TIMEOUT = 5
def create_db(db_path: str) -> DatabaseProxy:
pragmas = (
# Negative size is per api spec.
("cache_size", -1024 * 64),
# wal speeds up writes.
("journal_mode", "wal"),
("foreign_keys", 1),
)
sqlite_db = SqliteExtDatabase(
db_path,
timeout=DB_TIMEOUT,
pragmas=pragmas)
sqlite_db.connect()
atexit.register(sqlite_db.close)
db_proxy: DatabaseProxy = DatabaseProxy()
db_proxy.initialize(sqlite_db)
return db_proxy

Is there any order for add versus delete when committing in sqlalchemy

I'm adding a bunch of entries to a table in sqlalchemy. If an entry already exists, based on some key, I delete the row in the table and then add the "updated" entry. After finishing deleting and adding all entries I commit the session. However during testing the commit fails due to a unique constraint failure. My understanding from this error is that I'm trying to add the updated entry before deleting the old entry. If I delete the old entry, then commit, then add, everything works ok.
So my question is, does sqlalchemy have a defined order of operations for deleting and adding? Is it possible to change this order? Looking through my code, I noticed the object I'm adding is instantiated twice but only added once (see below) - maybe that's a problem (but not sure why it would be).
I also don't want the commit() because I only want to commit if I get through adding/updating all of the entries.
#Inside a loop
#-----------------------
temp_doc = Document(doc)
dirty_doc = session.query(Document).filter(Document.local_id == temp.local_id).first()
#other non-relevant code here ...
session.delete(dirty_doc)
#This seems to be needed but I wouldn't expect it to be
session.commit()
#Later on in the code ...
if add_new_doc:
temp_doc = Document(doc)
session.add(temp_doc)
#Outside the loop
#-----------------------------
session.commit()
session.close()
A similar yet different question was asked here regarding whether order was maintained within objects that were added:
Does SQLAlchemy save order when adding objects to session?
This prompted looking at the session code, since I haven't seen any documentation on flush behavior.
Session code link:
https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/orm/session.py
Search for def _flush(self
In the code it looks like adds and updates are done before deletes, which would explain why I'm running into my problem.
As a fix, I'm now flushing instead of committing (inside the loop), which seems to fix the problem. I'm assuming flush order is maintained, i.e. commands that are flushed first get committed/saved first.

Is a Rollback of my database possible in py2neo after execution of program?

I am trying to run a cypher query in py2neo and overcome some restrictions.
I actually want to add some weight to the edges of my graph for a specific execution but after the execution of the program I don't want the changes to remain on my neo4j DB (Rollback).
I want this in order to run the program/query with the edge weights as parameters every time
Thanks in advance!
It depends on timing. I believe py2neo uses the transactional cypher http endpoint. Rollback is possible before a transaction has been committed or finished, not after.
So let's say you're running your cypher query and doing other things at the same time.
tx = graph.cypher.begin()
statement = "some nifty mutating cypher in here"
tx.append(statement)
tx.commit()
By the time you hit commit you're done. The database doesn't necessarily work like git where you can undo any previous past change, or revert to a previous state of the database at a certain time. Generally you're creating transactions, then you're committing them, or rolling them back.
So, you can roll-back a transaction if you have not yet finished/committed it. This is useful because your transaction might include 8-9 different queries, all making changes. What if the first 4 succeed, then #5 fails? That's really what transaction rollback is meant to address, to make a big multi-query change into something atomic, that either all works, or all doesn't.
If this isn't what you want, then you should probably formulate a second cypher query which undoes whatever changes you're making to your weights. An example might be a pair of queries like this:
MATCH (a:Node)
SET a.old_weight=a.weight
WITH a
SET a.weight={myNewValue}
RETURN a;
Then undo it with:
MATCH (a:Node)
SET a.weight=a.old_weight
WITH a
DELETE a.old_weight
RETURN a;
Here's further documentation on transactions from the java API that describes a bit more how they work.

Can anyone tell me what' s the point of connection.commit() in python pyodbc ?

I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.
This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.
Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.

sqlite3: can I safely wait with commit()

I insert 10,000,000+ records into a sqlite base. Is it safe that I wait with connection.commit() until I finish all the insertions? Or there is some out of memory/overflow risk? Is it possible that the uncommitted entries will use all the available memory and cause page swapping? Committing after each INSERT is a performance killer, so I want to avoid it.
I use sqlite3 module with Python
Well, in the first place, sqlite is not really meant to handle that much data. It is generally for smaller data sets. However, committing that much data might take longer when committing all at once then if you were to commit say after every 2 or 3 inserts.
As far as your memory/overflow risk, sqlite dumps everything into a file chache then on commit put's all the file cache into the sqlitedb file.
However, committing all at once shouldnt give you any issues.
Committing all the insertions at once is a perfectly acceptable optimization. If an error should occur (which is unlikely) then it will occur within the client library, and at that point you can look into solving that issue. But worrying about that prematurely is unnecessary.
SQLite's rollback journal stores the old contents of all pages that were changed in the database file.
If you are only inserting data, you are not actually changing much data that already is in the database (only some management structures), so there will not accumulate much overhead data for the transaction.
(The situation would be different if you had enabled WAL mode.)

Categories

Resources