I'm using an SQLite database with peewee on multiple machines, and I'm encountering various OperationalError, DataBaseError. It's obviously a problem of multithreading, but I'm not at all an expert with this nor with SQL. Here's my setup and what I've tried.
Settings
I'm using peewee to log machine learning experiments. Basically, I have multiple nodes (like, different computers) which run a python file, and all write to the same base.db file in a shared location. On top of that, I need a single read access from my laptop, to see what's going on. There are at most ~50 different nodes which instantiate the database and write things on it.
What I've tried
At first, I used the SQLite object:
db = pw.SqliteDatabase(None)
# ... Define tables Experiment and Epoch
def init_db(file_name: str):
db.init(file_name)
db.create_tables([Experiment, Epoch], safe=True)
db.close()
def train():
xp = Experiment.create(...)
# Do stuff
with db.atomic():
Epoch.bulk_create(...)
xp.save()
This worked fine, but I sometimes had jobs which crashed because of the database being locked. Then, I learnt that SQLite only handled one write operation per connection, which caused the problem.
So I turned to SqliteQueueDatabase as, according to the documentation, it's useful if "if you want simple read and write access to a SQLite database from multiple threads." I also added those keywords I found on other thread which were said to be useful.
The code then looked like this:
db = SqliteQueueDatabase(None, autostart=False, pragmas=[('journal_mode', 'wal')],
use_gevent=False,)
def init_db(file_name: str):
db.init(file_name)
db.start()
db.create_tables([Experiment, Epoch], safe=True)
db.connect()
and the same for saving stuff except for the db.atomic part. However, not only do write queries seem to encounter errors, I practically no longer have access to the database for read: it is almost always busy.
My question
What is the right object to use in this case? I thought SqliteQueueDatabase was the perfect fit. Are pooled database a better fit? I'm also asking this question because I don't know if I have a good grasp on the threading part: the fact that multiple database object are initialized from multiple machines is different from having a single object on a single machine with multiple threads (like this situation). Right? Is there a good way to handle things then?
Sorry if this question is already answered in another place, and thanks for any help! Happy to provide more code if needed of course.
Sqlite only supports a single writer at a time, but multiple readers can have the db open (even while a writer is connected) when using WAL-mode. For peewee you can enable wal mode:
db = SqliteDatabase('/path/to/db', pragmas={'journal_mode': 'wal'})
The other crucial thing, when using multiple writers, is to keep your write transactions as short as possible. Some suggestions can be found here: https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ under the "Transactions, Concurrency and Autocommit" heading.
Also note that SqliteQueueDatabase works well for a single process with multiple threads, but will not help you at all if you have multiple processes.
Inded, After #BoarGules comment, I realize that I confused two very different things:
Having multiple threads on a single machine: here, SqliteQueueDatabase is a very good fit
Having multiple machines, with one or more threads: that's basically how internet works.
So I ended up installing Postgre. A few links if it can be useful to people coming after me, for linux:
Install Postgre. You can build it from source if you don't have root privilege following chapter 17 from the official documentation, then Chapter 19.
You can export an SQLite database with pgloader. But again, if you don't have the right librairies and don't want to build everything, you can do it by hand. I did the following, not sure if more straightforward solution exist.
Export your tables as csv (following #coleifer's comment):
models = [Experiment, Epoch]
for model in models:
outfile = '%s.csv' % model._meta.table_name
with open(outfile, 'w', newline='') as f:
writer = csv.writer(f)
row_iter = model.select().tuples().iterator()
writer.writerows(row_iter)
Create the table in the new Postgre database:
db = pw.PostgresqlDatabase('mydb', host='localhost')
db.create_tables([Experiment, Epoch], safe=True)
Copy the CSV tables to Postgre db with the following command:
COPY epoch("col1", "col2", ...) FROM '/absolute/path/to/epoch.csv'; DELIMITER ',' CSV;
and likewise for the other tables.
IT worked fine for me, as I had only two tables. Can be annoying if you have more than that. pgloader seems a very good solution in that case, if you can install it easily.
Update
I could not create objects from peewee at first. I had integrity error: it seemed that the id which was returned by Postgre (with the RETURNING 'epoch'.'id' clause) was returning an already existing id. From my understanding, it was because the increment had not been called when using the COPY command. Thus, it only returned id 1, then 2, and so on until it reached an non existing id. To avoid going through all this failed creation, you can directly edit the iterator governing the RETURN clause, with:
ALTER SEQUENCE epoch_id_seq RESTART WITH 10000
and replace 10000 with the value from SELECT MAX("id") FROM epoch, +1.
I think you can just increase the timeout for sqlite and be fix your problem.
The issue here is that the default sqlite timeout for writing is low, and when there is even small amounts of concurrent writes, sqlite will start throwing exceptions. This is common and well known.
The default should be something like 5-10 seconds. If you exceed this timeout then either increase it or chunk up your writes to the db.
Here is an example:
I return a DatabaseProxy here because this proxy allows sqlite to be swapped out for postgres without changing client code.
import atexit
from peewee import DatabaseProxy # type: ignore
from playhouse.db_url import connect # type: ignore
from playhouse.sqlite_ext import SqliteExtDatabase # type: ignore
DB_TIMEOUT = 5
def create_db(db_path: str) -> DatabaseProxy:
pragmas = (
# Negative size is per api spec.
("cache_size", -1024 * 64),
# wal speeds up writes.
("journal_mode", "wal"),
("foreign_keys", 1),
)
sqlite_db = SqliteExtDatabase(
db_path,
timeout=DB_TIMEOUT,
pragmas=pragmas)
sqlite_db.connect()
atexit.register(sqlite_db.close)
db_proxy: DatabaseProxy = DatabaseProxy()
db_proxy.initialize(sqlite_db)
return db_proxy
Related
I'm after a way of querying Impala through Python which enables you to keep a connection open and pass queries to it.
I can connect quite happily to Impala using this sort of code:
import subprocess
sql = 'some sort of sql statement;'
cmds = ['impala-shell','-k','-B','-i','impala.company.corp','-q', sql]
out,err = subprocess.Popen(cmds, stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()
print(out.decode())
print(err.decode())
I can also switch out the -q and sql for -f and a file with sql statements as per the documentation here.
When I'm running this for multiple sql statements the name node it uses is the same for all the queries and it it will stop if there is a failure in the code (unless I use the option to continue), this is all expected.
What I'm trying to get to is where I can run a query or two, check the results using some python logic and then continue if it meets my criteria.
I have tried splitting up my code into individual queries using sqlparse and running them one by one. This works well in isolation but if one statement is a drop table if exists x; and the next one then goes create table x (blah string); then if x did actually exist then because the second statement will run on a different node the dropping metadata change hasn't reached that one yet and it fails with table x already exists or similar error.
I'd think as well as getting round this metadata issue it would just make more sense to keep a connection open to impala whilst I run all the statements but I'm struggling to work this out.
Does anyone have any code that has this functionality?
You may wanna look at impyla, the Impala/Hive python client, if you haven't done so already.
As far as the second part of your question, using Impala's SYNC_DDL option will guarantee that DDL changes are propagated across impalads before next DDL is executed.
We're experiencing some slowdown, and frustrating database lockups with our current solution, which essentially consists of calling stored procedures on an MSSQL server to manipulate data. If two or more users try to hit the same table simultaneously, one is locked out and their request fails.
The proposed solution to this problem was to bring the data into python using sqlalchemy, and perform any manipulations / calculations on it in dataframes. This worked but was incredibly slow because of the network calls to the DB.
Is there a better solution which can support multiple concurrent users, without causing too much of a slowdown?
You can use nolock keyword in stored procedure to remove this problem
in your stored procedure where you specify table name in front of that write nolock keyword i hope it will be work for you
eg.
select * from tablename1 t1
join nolock tablename2 t2 on t2.id=t1.id
You can alter the webapp to check if the proc is running already, and either abort the run event (on click), or proactively prevent it by disabling the button all together (timer to re-enable?).
SELECT *
FROM (SELECT * FROM sys.dm_exec_requests WHERE sql_handle IS NOT NULL) A
CROSS APPLY sys.dm_exec_sql_text(A.sql_handle) T
WHERE T.text LIKE 'dbo.naughty_naughty_proc_name%'
Then perhaps alter the proc as a safeguard to prevent multiple instances using sp_getapplock.
I would not blindly change to read uncommitted as your isolation level. In my opinion that is very bad advice when we don't have any context surrounding how important this system/data is and you clearly state the data is "being manipulated". You really need to understand the data, and the system you're impacting before doing this!
Some reading:
https://www.mssqltips.com/sqlservertip/3202/prevent-multiple-users-from-running-the-same-sql-server-stored-procedure-at-the-same-time/
Why use a READ UNCOMMITTED isolation level?
I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.
I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.
Please bear with me as I explain the problem, how I tried to solve it,
and my question on how to improve it is at the end.
I have a 100,000 line csv file from an offline batch job and I needed to
insert it into the database as its proper models. Ordinarily, if this is a fairly straight-forward load, this can be trivially loaded by just munging the CSV file to fit a schema; but, I had to do some external processing that requires querying and it's just much more convenient to use SQLAlchemy to generate the data I want.
The data I want here is 3 models that represent 3 pre-exiting tables
in the database and each subsequent model depends on the previous model.
For example:
Model C --> Foreign Key --> Model B --> Foreign Key --> Model A
So, the models must be inserted in the order A, B, and C. I came up
with a producer/consumer approach:
- instantiate a multiprocessing.Process which contains a
threadpool of 50 persister threads that have a threadlocal
connection to a database
- read a line from the file using the csv DictReader
- enqueue the dictionary to the process, where each thread creates
the appropriate models by querying the right values and each
thread persists the models in the appropriate order
This was faster than a non-threaded read/persist but it is way slower than
bulk-loading a file into the database. The job finished persisting
after about 45 minutes. For fun, I decided to write it in SQL
statements, it took 5 minutes.
Writing the SQL statements took me a couple of hours, though. So my
question is, could I have used a faster method to insert rows using
SQLAlchemy? As I understand it, SQLAlchemy is not designed for bulk
insert operations, so this is less than ideal.
This follows to my question, is there a way to generate the SQL statements using SQLAlchemy, throw
them in a file, and then just use a bulk-load into the database? I
know about str(model_object) but it does not show the interpolated
values.
I would appreciate any guidance for how to do this faster.
Thanks!
Ordinarily, no, there's no way to get the query with the values included.
What database are you using though? Cause a lot of databases do have some bulk load feature for CSV available.
Postgres: http://www.postgresql.org/docs/8.4/static/sql-copy.html
MySQL: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Oracle: http://www.orafaq.com/wiki/SQL*Loader_FAQ
If you're willing to accept that certain values might not be escaped correctly than you can use this hack I wrote for debugging purposes:
'''Replace the parameter placeholders with values'''
params = compiler.params.items()
params.sort(key=lambda (k, v): len(str(k)), reverse=True)
for k, v in params:
'''Some types don't need escaping'''
if isinstance(v, (int, long, float, bool)):
v = unicode(v)
else:
v = "'%s'" % v
'''Replace the placeholders with values
Works both with :1 and %(foo)s type placeholders'''
query = query.replace(':%s' % k, v)
query = query.replace('%%(%s)s' % k, v)
First, unless you actually have a machine with 50 CPU cores, using 50 threads/processes won't help performance -- it will actually make things slower.
Second, I've a feeling that if you used SQLAlchemy's way of inserting multiple values at once, it would be much faster than creating ORM objects and persisting them one-by-one.
I would venture to say the time spent in the python script is in the per-record upload portion. To determine this you could write to CSV or discard the results instead of uploading new records. This will determine where the bottleneck is; at least from a lookup-vs-insert standpoint. If, as I suspect, that is indeed where it is you can take advantage of the bulk import feature most DBS have. There is no reason, and indeed some arguments against, inserting record-by-record in this kind of circumstance.
Bulk imports tend to do some interestng optimization such as doing it as one transaction w/o commits for each record (even just doing this could see an appreciable drop in run time); whenever feasible I recommend the bulk insert for large record counts. You could still use the producer/consumer approach, but have the consumer instead store the values in memory or in a file and then call the bulk import statement specific to the DB you are using. This might be the route to go if you need to do processing for each record in the CSV file. If so I would also consider how much of that can be cached and shared between records.
it is also possible that the bottleneck is using SQLAlchemy. Not that I know of any inherent issues, but given what you are doing it might be requiring a lot more processing than is necessary - as evidenced by the 8x difference in run times.
For fun, since you already know the SQL, try using a direct DBAPI module in Python to do it and compare run times.