We're experiencing some slowdown, and frustrating database lockups with our current solution, which essentially consists of calling stored procedures on an MSSQL server to manipulate data. If two or more users try to hit the same table simultaneously, one is locked out and their request fails.
The proposed solution to this problem was to bring the data into python using sqlalchemy, and perform any manipulations / calculations on it in dataframes. This worked but was incredibly slow because of the network calls to the DB.
Is there a better solution which can support multiple concurrent users, without causing too much of a slowdown?
You can use nolock keyword in stored procedure to remove this problem
in your stored procedure where you specify table name in front of that write nolock keyword i hope it will be work for you
eg.
select * from tablename1 t1
join nolock tablename2 t2 on t2.id=t1.id
You can alter the webapp to check if the proc is running already, and either abort the run event (on click), or proactively prevent it by disabling the button all together (timer to re-enable?).
SELECT *
FROM (SELECT * FROM sys.dm_exec_requests WHERE sql_handle IS NOT NULL) A
CROSS APPLY sys.dm_exec_sql_text(A.sql_handle) T
WHERE T.text LIKE 'dbo.naughty_naughty_proc_name%'
Then perhaps alter the proc as a safeguard to prevent multiple instances using sp_getapplock.
I would not blindly change to read uncommitted as your isolation level. In my opinion that is very bad advice when we don't have any context surrounding how important this system/data is and you clearly state the data is "being manipulated". You really need to understand the data, and the system you're impacting before doing this!
Some reading:
https://www.mssqltips.com/sqlservertip/3202/prevent-multiple-users-from-running-the-same-sql-server-stored-procedure-at-the-same-time/
Why use a READ UNCOMMITTED isolation level?
Related
I'm using an SQLite database with peewee on multiple machines, and I'm encountering various OperationalError, DataBaseError. It's obviously a problem of multithreading, but I'm not at all an expert with this nor with SQL. Here's my setup and what I've tried.
Settings
I'm using peewee to log machine learning experiments. Basically, I have multiple nodes (like, different computers) which run a python file, and all write to the same base.db file in a shared location. On top of that, I need a single read access from my laptop, to see what's going on. There are at most ~50 different nodes which instantiate the database and write things on it.
What I've tried
At first, I used the SQLite object:
db = pw.SqliteDatabase(None)
# ... Define tables Experiment and Epoch
def init_db(file_name: str):
db.init(file_name)
db.create_tables([Experiment, Epoch], safe=True)
db.close()
def train():
xp = Experiment.create(...)
# Do stuff
with db.atomic():
Epoch.bulk_create(...)
xp.save()
This worked fine, but I sometimes had jobs which crashed because of the database being locked. Then, I learnt that SQLite only handled one write operation per connection, which caused the problem.
So I turned to SqliteQueueDatabase as, according to the documentation, it's useful if "if you want simple read and write access to a SQLite database from multiple threads." I also added those keywords I found on other thread which were said to be useful.
The code then looked like this:
db = SqliteQueueDatabase(None, autostart=False, pragmas=[('journal_mode', 'wal')],
use_gevent=False,)
def init_db(file_name: str):
db.init(file_name)
db.start()
db.create_tables([Experiment, Epoch], safe=True)
db.connect()
and the same for saving stuff except for the db.atomic part. However, not only do write queries seem to encounter errors, I practically no longer have access to the database for read: it is almost always busy.
My question
What is the right object to use in this case? I thought SqliteQueueDatabase was the perfect fit. Are pooled database a better fit? I'm also asking this question because I don't know if I have a good grasp on the threading part: the fact that multiple database object are initialized from multiple machines is different from having a single object on a single machine with multiple threads (like this situation). Right? Is there a good way to handle things then?
Sorry if this question is already answered in another place, and thanks for any help! Happy to provide more code if needed of course.
Sqlite only supports a single writer at a time, but multiple readers can have the db open (even while a writer is connected) when using WAL-mode. For peewee you can enable wal mode:
db = SqliteDatabase('/path/to/db', pragmas={'journal_mode': 'wal'})
The other crucial thing, when using multiple writers, is to keep your write transactions as short as possible. Some suggestions can be found here: https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ under the "Transactions, Concurrency and Autocommit" heading.
Also note that SqliteQueueDatabase works well for a single process with multiple threads, but will not help you at all if you have multiple processes.
Inded, After #BoarGules comment, I realize that I confused two very different things:
Having multiple threads on a single machine: here, SqliteQueueDatabase is a very good fit
Having multiple machines, with one or more threads: that's basically how internet works.
So I ended up installing Postgre. A few links if it can be useful to people coming after me, for linux:
Install Postgre. You can build it from source if you don't have root privilege following chapter 17 from the official documentation, then Chapter 19.
You can export an SQLite database with pgloader. But again, if you don't have the right librairies and don't want to build everything, you can do it by hand. I did the following, not sure if more straightforward solution exist.
Export your tables as csv (following #coleifer's comment):
models = [Experiment, Epoch]
for model in models:
outfile = '%s.csv' % model._meta.table_name
with open(outfile, 'w', newline='') as f:
writer = csv.writer(f)
row_iter = model.select().tuples().iterator()
writer.writerows(row_iter)
Create the table in the new Postgre database:
db = pw.PostgresqlDatabase('mydb', host='localhost')
db.create_tables([Experiment, Epoch], safe=True)
Copy the CSV tables to Postgre db with the following command:
COPY epoch("col1", "col2", ...) FROM '/absolute/path/to/epoch.csv'; DELIMITER ',' CSV;
and likewise for the other tables.
IT worked fine for me, as I had only two tables. Can be annoying if you have more than that. pgloader seems a very good solution in that case, if you can install it easily.
Update
I could not create objects from peewee at first. I had integrity error: it seemed that the id which was returned by Postgre (with the RETURNING 'epoch'.'id' clause) was returning an already existing id. From my understanding, it was because the increment had not been called when using the COPY command. Thus, it only returned id 1, then 2, and so on until it reached an non existing id. To avoid going through all this failed creation, you can directly edit the iterator governing the RETURN clause, with:
ALTER SEQUENCE epoch_id_seq RESTART WITH 10000
and replace 10000 with the value from SELECT MAX("id") FROM epoch, +1.
I think you can just increase the timeout for sqlite and be fix your problem.
The issue here is that the default sqlite timeout for writing is low, and when there is even small amounts of concurrent writes, sqlite will start throwing exceptions. This is common and well known.
The default should be something like 5-10 seconds. If you exceed this timeout then either increase it or chunk up your writes to the db.
Here is an example:
I return a DatabaseProxy here because this proxy allows sqlite to be swapped out for postgres without changing client code.
import atexit
from peewee import DatabaseProxy # type: ignore
from playhouse.db_url import connect # type: ignore
from playhouse.sqlite_ext import SqliteExtDatabase # type: ignore
DB_TIMEOUT = 5
def create_db(db_path: str) -> DatabaseProxy:
pragmas = (
# Negative size is per api spec.
("cache_size", -1024 * 64),
# wal speeds up writes.
("journal_mode", "wal"),
("foreign_keys", 1),
)
sqlite_db = SqliteExtDatabase(
db_path,
timeout=DB_TIMEOUT,
pragmas=pragmas)
sqlite_db.connect()
atexit.register(sqlite_db.close)
db_proxy: DatabaseProxy = DatabaseProxy()
db_proxy.initialize(sqlite_db)
return db_proxy
I have three programs running, one of which iterates over a table in my database non-stop (over and over again in a loop), just reading from it, using a SELECT statement.
The other programs have a line where they insert a row into the table and a line where they delete it. The problem is, that I often get an error sqlite3.OperationalError: database is locked.
I'm trying to find a solution but I don't understand the exact source of the problem (is reading and writing in the same time what make this error occur? or the writing and deleting? maybe both aren't supposed to work).
Either way, I'm looking for a solution. If it were a single program, I could match the database I/O with mutexes and other multithreading tools, but it's not. How can I wait until the database is unlocked for reading/writing/deleting without using too much CPU?
you need to switch databases..
I would use the following:
postgresql as my database
psycopg2 as the driver
the syntax is fairly similar to SQLite and the migration shouldn't be too hard for you
I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.
I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.
This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.
Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.
I`m trying to write loader to sqlite that will load as fast as possible simple rows in DB.
Input data looks like rows retrieved from postgres DB. Approximated amount of rows that will go to sqlite: from 20mil to 100mil.
I cannot use other DB except sqlite due to project restrictions.
My question is :
what is a proper logic to write such loader?
At first try I`ve tried to write set of encapsulated generators, that will take one row from Postgres, slightly ammend it and put it into sqlite. I ended up with the fact that for each row, i create separate sqlite connection and cursor. And that looks awfull.
At second try , i moved sqlite connection and cursor out of the generator , to the body of the script and it became clear that i do not commit data to sqlite untill i fetch and process all 20mils records. And this possibly could crash all my hardware.
At third try I strated to consider to keep Sqlite connection away from the loops , but create/close cursor each time i process and push one row to Sqlite. This is better but i think also have some overhead.
I also considered to play with transactions : One connection, one cursor, one transaction and commit called in generator each time row is being pushed to Sqlite. Is this i right way i`m going?
Is there some widely-used pattern to write such a component in python? Because I feel as if I am inventing a bicycle.
SQLite can handle huge transactions with ease, so why not commit at the end? Have you tried this at all?
If you do feel one transaction is a problem, why not commit ever n transactions? Process rows one by one, insert as needed, but every n executed insertions add a connection.commit() to spread the load.
See my previous answer about bulk and SQLite. Possibly my answer here as well.
A question: Do you control the SQLite database? There are compile time options you can tweak related to cache sizes, etc. you can adjust for your purposes as well.
In general, the steps in #1 are going to get you the biggest bang-for-your-buck.
Finally i managed to resolve my problem. Main issue was in exessive amount of insertions in sqlite. After i started to load all data from postgress to memory, aggregate it proper way to reduce amount of rows, i was able to decrease processing time from 60 hrs to 16 hrs.