I have an existing sqlite3 db file, on which I need to make some extensive calculations. Doing the calculations from the file is painfully slow, and as the file is not large (~10 MB), so there should be no problem to load it into memory.
Is there a Pythonic way to load the existing file into memory in order to speed up the calculations?
Here is the snippet that I wrote for my flask application:
import sqlite3
from io import StringIO
def init_sqlite_db(app):
# Read database to tempfile
con = sqlite3.connect(app.config['SQLITE_DATABASE'])
tempfile = StringIO()
for line in con.iterdump():
tempfile.write('%s\n' % line)
con.close()
tempfile.seek(0)
# Create a database in memory and import from tempfile
app.sqlite = sqlite3.connect(":memory:")
app.sqlite.cursor().executescript(tempfile.read())
app.sqlite.commit()
app.sqlite.row_factory = sqlite3.Row
What about sqlite3.Connection.backup(...)? "This method makes a backup of a SQLite database even while it’s being accessed by other clients, or concurrently by the same connection." Availability: SQLite 3.6.11 or higher. New in version 3.7.
import sqlite3
source = sqlite3.connect('existing_db.db')
dest = sqlite3.connect(':memory:')
source.backup(dest)
sqlite3.Connection.iterdump "[r]eturns an iterator to dump the database in an SQL text format. Useful when saving an in-memory database for later restoration. This function provides the same capabilities as the .dump command in the sqlite3 shell."
Get such an iterator and dump the disk-based database into a memory-based one, and you're ready to compute. When the computation is done, just dump the other way around back to disk.
First you should try and find out what is causing the slowness you are observing. Are you writing to tables? Are your writes within large enough transactions so that you don't save needless temporary results to disk? Can you change writes to go to temporary tables (with pragma temp_store=memory)? Can you live with pragma synchronous=off?
I don't think this functionality is exposed in the Python module, but sqlite has a backup API that sounds like exactly what you are asking for: a way to copy from one database to another (either one of which may be an in-memory database) that works pretty much automatically without any user-visible enumeration of tables. (Maybe APSW exposes this?)
Another option is to create a ram disk (if you have sufficient control of the environment) and copy the file there.
if we must use a python wrapper,then there are no better solution than the two pass, read and write solution.
but beginning with version 3.7.17, SQLite has the option of accessing disk content directly using memory-mapped I/O.sqlite mmap
if you want to use mmap,you have to use the C interface since no wrapper provide it.
and there is another hardware solution,the memory disk.then you have the convenient file IO and the speed of memory.
This has already been answered before, including code examples at In python, how can I load a sqlite db completely to memory before connecting to it?
You do not mention operating system, but one gotcha of Windows XP is that it defaults to a 10MB file cache, no matter how much memory you have. (This made sense in the days when systems came with 64MB etc). This message has several links:
http://marc.info/?l=sqlite-users&m=116743785223905&w=2
Here is a relatively simple way to read a SQLite db into memory. Depending upon your preferences with regard to manipulating data, you either use Pandas dataframe or write your table to a in-memory sqlite3 database. Similarly after manipulating your data you use the same df.to_sqlite approach to store your results back into a db table.
import sqlite3 as lite
from pandas.io.sql import read_sql
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
c = engine.connect()
conmem = c.connection
con = lite.connect('ait.sqlite', isolation_level=None) #Here is the connection to <ait.sqlite> residing on disk
cur = con.cursor()
sqlx = 'SELECT * FROM Table'
df = read_sql(sqlx, con, coerce_float=True, params=None)
#Read SQLite table into a panda dataframe
df.to_sql(con=conmem, name='Table', if_exists='replace', flavor='sqlite')
With the solution of Cenk Alti, I always had a MemoryError with Python 3.7, when the process reached 500MB. Only with the use of the backup functionality of sqlite3 (mentioned by thinwybk), I was able to to load and save bigger SQLite databases. Also you can do the same with just 3 lines of code, both ways.
The answers of #thinwybk and Crooner are both excellent.
When you have multiple connections to :memory: sqlite databases, for instance when using SQLAlchemy together with the source.backup(dest) function, then you may end up in not placing the backup into the "right" memory DB.
This can be fixed using a proper connection string: https://stackoverflow.com/a/65429612/1617295
and does not involve any hack not use of undocumented features.
sqlite supports in-memory databases.
In python, you would use a :memory: database name for that.
Perhaps you could open two databases (one from the file, an empty one in-memory), migrate everything from the file database into memory, then use the in-memory database further to do calculations.
Related
I have a Python script to import data from raw csv/xlsx files. For these I use Pandas to load the files, do some light transformation, and save to an sqlite3 database. This is fast (as fast as any method). After this, I run some queries against these to make some intermediate datasets. These I run through a function (see below).
More information: I am using Anaconda/Python3 (3.9) on Windows 10 Enterprise.
UPDATE:
Just as information for anybody reading this, I ended up going back to
just using standalone python (still using JupyterLab though)... I no
longer have this issue. So not sure if it is a problem with something
Anaconda does or just the versions of various libraries being used for
that particular Anaconda distribution (latest available). My script
runs more or less in the time that I would expect using Python 3.11
and the versions pulled in by pip for Pandas and sqlite (1.5.3 and
3.38.4).
Python function for running sqlite3 queries:
def runSqliteScript(destConnString, queryString):
'''Runs an sqlite script given a connection string and a query string
'''
try:
print('Trying to execute sql script: ')
print(queryString)
cursorTmp = destConnString.cursor()
cursorTmp.executescript(queryString)
except Exception as e:
print('Error caught: {}'.format(e))
Because somebody asked, here is the function that creates the "destConnString", though it's called something else in the actual function call, but is the same type.
def createSqliteDb(db_file):
''' Creates an sqlite database at direct/file name specified
'''
conSqlite = None
try:
conSqlite = sqlite3.connect(db_file)
return conSqlite
except Error as e:
print('Error {} when trying to create {}'.format(e, db_file))
Example of one of the queries (I commented out journal mode/synchronous pragmas after it didn't seem to help at all):
-- PRAGMA journal_mode = WAL;
-- PRAGMA synchronous = NORMAL;
BEGIN;
drop table if exists tbl_1110_cop_omd_fmd;
COMMIT;
BEGIN;
create table tbl_1110_cop_omd_fmd as
select
siteId,
orderNumber,
familyGroup01, familyGroup02,
count(*) as countOfLines
from tbl_0000_ob_trx_for_frazelle
where 1 = 1
-- and dateCreated between datetime('now', '-365 days') and datetime('now', 'localtime') -- temporarily commented due to no date in file
group by siteId,
orderNumber,
familyGroup01, familyGroup02
order by dateCreated asc
;
COMMIT
;
Here is a list of things that I have tried. Unfortunately, no matter what combination of things I have tried, it has ended up having one bottleneck or another. It seems there is some kind of write bottleneck from python to sqlite3, yet the pandas to_sql method doesn't seem to be affected by it. Complete list of all combinations of things that I have tried.
I tried wrapping all my queries in begin/commit statements. I put these in-line with the query, though I'd be interested in knowing if this is the correct way to do this. This seemed to have no effect.
I tried setting the journal mode to WAL and synchronous to normal, again to no effect.
I tried running the queries in an in-memory database.
Firstly, I tried creating everything from scratch in the in-memory database. The tables didn't create any faster. Saving this in-memory database seems to be a bottleneck (backup method).
Next, I tried creating views instead of tables (again, creating everything from scratch in the in-memory database). This created really quickly. Weirdly, querying these views was very fast. Saving this in-memory database seems to be a bottleneck (backup method).
I tried just writing views to the database file (not in-memory). Unfortunately, the views take as long as the make tables when running from Python/sqlite.
I don't really want to do anything strictly in-memory for the database creation, as this python script is used for different sets of data, some which could have too many rows for an in-memory setup. The only thing I have left to try is to take the in-memory from scratch setup, make views instead of tables, read ALL the in-memory db tables with pandas (from_sql), then write ALL the tables to a file db with pandas (to_sql)... Hoping there is something easy to try to resolve this problem.
connOBData = sqlite3.connect('file:cachedb?mode=memory?cache=shared')
These take approximately 1,000 times or more longer than if I run these queries directly in DB Browser (an sqlite frontend). These queries aren't that complex and run fine (in ~2-4 seconds) in DB Browser. All told, if I run all the queries in a row in DB Browser they'd run in 1-2 minutes. If I let them run through the Python script, it literally takes close to 20 hours. I'd expect the queries to finish in approximately the same time that they run in DB Browser.
I am going to work with very large spatio-temporal dataset stored in PostgreSQL server with read-only access. The goal is to load these dataset to Python environment, do all processing and create some kind of learning model.
It wouldn't be a nice idea to have these dataset loaded in python from postgres every time I want work.
I am wondering if there is a way to have Postgres dump these dataset to disk in a format (e.g. .pkl or .hdf) that could support faster loading in the future. I really don't think \COPY to csv would be an option.
If you're going from PostgreSQL, then COPY or \copy is your only option. You haven't said why that is not viable.
I'm assuming the database is changing and you want to pull in recent data into your Python program. If the database is static, then you could possibly
use COPY or \copy once
read the result into Python
save the data into another format
hdf using h5py
numpy format using numpy.save
Another way to work the problem is to pull the data from Python using a database connector. This solution is specific to MySQL, but should work with a PostgreSQL data connector. Here is that code adapted for PostgreSQL:
import psycopg2
import numpy
conn = psycopg2.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor()
numrows = curs.execute("SELECT id, rating FROM video")
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
# you will have to adapt to your data type and data structure
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))
I have an SQLite database (110kb) in an S3 bucket. I want to connect to that database every time I run my Python application.
An option is to download database everytime I run the Python application and connect it. But I want to know if there exists a way to connect to that SQLite database through memory, using S3FileSystem and open.
I'm using SQLite3 library in Python 3.6.
As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.
However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.
SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.
In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.
I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.
S3FS = S3VFS() # S3VFS defined below
# This odd format is used due to SQLite requirements
sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
S3_PREFIX,
DATABASE_NAME,
S3_BUCKET
)
connection = apsw.Connection(sqlite_uri,
flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
vfs=S3FS.vfsname
)
cursor = connection.cursor()
Once you have the cursor, you can execute standard SQL statements like so:
for x,y,z in cursor.execute("select x,y,z from foo"):
print (cursor.getdescription()) # shows column names and declared types
print (x,y,z)
VFS Implementation (requires APSW library and boto3 for S3 connectivity)
import apsw
import sys
import boto3
VFS_S3_CLIENT = boto3.client('s3')
class S3VFS(apsw.VFS):
def __init__(self, vfsname="s3", basevfs=""):
self.vfsname=vfsname
self.basevfs=basevfs
apsw.VFS.__init__(self, self.vfsname, self.basevfs)
def xOpen(self, name, flags):
return S3VFSFile(self.basevfs, name, flags)
class S3VFSFile():
def __init__(self, inheritfromvfsname, filename, flags):
self.bucket = filename.uri_parameter("bucket")
self.key = filename.filename().lstrip("/")
print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))
def xRead(self, amount, offset):
response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
response_data = response['Body'].read()
return response_data
def xFileSize(self):
client = boto3.client('s3')
response = client.head_object( Bucket=self.bucket, Key=self.key)
return response['ContentLength']
def xClose(self):
pass
def xFileControl(self, op, ptr):
return False
def _get_s3_url(self):
return "s3://{}/{}".format(self.bucket, self.key)
Yes, it's possible with EFS:
https://www.lambrospetrou.com/articles/aws-lambda-and-sqlite-over-efs/
AWS recently released integration between AWS Lambda and Amazon EFS.
It is supporting NFSv4 lock upgrading/downgrading which is needed by SQLite.
This means SQLite engine can have read/write access to files stored on EFS filesystem.
Other answers explain how the sqlite engine can register an alternative "file system" that can work with database files in various cloud storage scenarios. This includes both custom solutions and updated cloud platform solutions. If these systems satisfy your need, by all means go for it.
These possibilities forced me to rethink my previous definite "No" answer, but they ultimately do not change the underlying nature of the sqlite engine. It is still true that the engine has no built-in network/web api and is not designed for that. That is not my opinion, rather it says as much on the official website. It is still true that--even with an alternative file system--the engine works with database "files", so to persist and load in-memory databases still necessitates a "file system". So for most applications and most developers who want an optimal and/or plug-and-play solution that matches the design goals of sqlite, my original answer stands.
Original answer:
No, it is not possible to connect directly to an sqlite database stored in the cloud. Even if you want to host the database in memory, it must still be downloaded completely before being loaded into memory. To do that still requires that the database first be loaded from a disc-based file or using DDL commands to create it directly in memory. See Loading and Saving In-Memory Databases.
In such a scenario, once the database was disconnected, it would then need to be re-uploaded to the cloud storage. S3FileSystem.open just returns a data stream. All the stream will allow you to do is download the file to a local store so that it can be opened / manipulated locally.
If you really need a cloud database, you need to look into another sort of hosted database.
(Inspired by dacort's answer)
If the database is just used for reads, there is https://github.com/michalc/sqlite-s3-query (full disclosure: written by me)
Modifying an example from its README, say you have my-db.sqlite in a bucket my-bucket in eu-west-2 (and credentials in environment variables):
from sqlite_s3_query import sqlite_s3_query
with \
sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query, \
query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
for row in rows:
print(row)
if all your actions are limited in reading from SQLite, I guess it is possible. But I have no idea if writing is also possible.
in my case, I'm using gdal(requires libgdal), and gdal's /vsis3, /vsis3-streaming (based on /vsicurl) gives you the ability to read SQLite and many other data source from the cloud. if you want to use raw SQLite rather than based on gdal's data source layer, you can just write them to your local database via gdal's API, however, if so, why not just download it and read it?
for me, since I'm working on spatial data and gdal's DataSource provides a lot of APIs to manipulate spatial data, this method works fine. I'm still looking for a good way to write to a cloud based SQLite.
FYI, here's the document of gdal virtual file system
https://gdal.org/user/virtual_file_systems.html
I'm user of a Python application that has poorly indexed tables and was wondering if it's possible to improve performance by converting the SQLite database into an in-memory database upon application startup. My thinking is that it would minimize the issue of full table scans, especially since SQLite might be creating autoindexes, as the documentation says that is enabled by default. How can this be accomplished using the SQLAlchemy ORM (that is what the application uses)?
At the start of the program, move the database file to a ramdisk, point SQLAlchemy to it and do your processing, and then move the SQLite file back to non-volatile storage.
It's not a great solution, but it'll help you determine whether caching your database in memory is worthwhile.
Whenever you set a variable in Python you are instantiating an object. This means you are allocating memory for it.
When you query sqlite you are simply reading information off the disk into memory.
sqlalchemy is simply an abstraction. You read the data from disk into memory in the same way, by querying the database and setting the returned data to a variable.
We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.