Connect to a database in cloud - python

I have an SQLite database (110kb) in an S3 bucket. I want to connect to that database every time I run my Python application.
An option is to download database everytime I run the Python application and connect it. But I want to know if there exists a way to connect to that SQLite database through memory, using S3FileSystem and open.
I'm using SQLite3 library in Python 3.6.

As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.
However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.
SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.
In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.
I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.
S3FS = S3VFS() # S3VFS defined below
# This odd format is used due to SQLite requirements
sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
S3_PREFIX,
DATABASE_NAME,
S3_BUCKET
)
connection = apsw.Connection(sqlite_uri,
flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
vfs=S3FS.vfsname
)
cursor = connection.cursor()
Once you have the cursor, you can execute standard SQL statements like so:
for x,y,z in cursor.execute("select x,y,z from foo"):
print (cursor.getdescription()) # shows column names and declared types
print (x,y,z)
VFS Implementation (requires APSW library and boto3 for S3 connectivity)
import apsw
import sys
import boto3
VFS_S3_CLIENT = boto3.client('s3')
class S3VFS(apsw.VFS):
def __init__(self, vfsname="s3", basevfs=""):
self.vfsname=vfsname
self.basevfs=basevfs
apsw.VFS.__init__(self, self.vfsname, self.basevfs)
def xOpen(self, name, flags):
return S3VFSFile(self.basevfs, name, flags)
class S3VFSFile():
def __init__(self, inheritfromvfsname, filename, flags):
self.bucket = filename.uri_parameter("bucket")
self.key = filename.filename().lstrip("/")
print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))
def xRead(self, amount, offset):
response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
response_data = response['Body'].read()
return response_data
def xFileSize(self):
client = boto3.client('s3')
response = client.head_object( Bucket=self.bucket, Key=self.key)
return response['ContentLength']
def xClose(self):
pass
def xFileControl(self, op, ptr):
return False
def _get_s3_url(self):
return "s3://{}/{}".format(self.bucket, self.key)

Yes, it's possible with EFS:
https://www.lambrospetrou.com/articles/aws-lambda-and-sqlite-over-efs/
AWS recently released integration between AWS Lambda and Amazon EFS.
It is supporting NFSv4 lock upgrading/downgrading which is needed by SQLite.
This means SQLite engine can have read/write access to files stored on EFS filesystem.

Other answers explain how the sqlite engine can register an alternative "file system" that can work with database files in various cloud storage scenarios. This includes both custom solutions and updated cloud platform solutions. If these systems satisfy your need, by all means go for it.
These possibilities forced me to rethink my previous definite "No" answer, but they ultimately do not change the underlying nature of the sqlite engine. It is still true that the engine has no built-in network/web api and is not designed for that. That is not my opinion, rather it says as much on the official website. It is still true that--even with an alternative file system--the engine works with database "files", so to persist and load in-memory databases still necessitates a "file system". So for most applications and most developers who want an optimal and/or plug-and-play solution that matches the design goals of sqlite, my original answer stands.
Original answer:
No, it is not possible to connect directly to an sqlite database stored in the cloud. Even if you want to host the database in memory, it must still be downloaded completely before being loaded into memory. To do that still requires that the database first be loaded from a disc-based file or using DDL commands to create it directly in memory. See Loading and Saving In-Memory Databases.
In such a scenario, once the database was disconnected, it would then need to be re-uploaded to the cloud storage. S3FileSystem.open just returns a data stream. All the stream will allow you to do is download the file to a local store so that it can be opened / manipulated locally.
If you really need a cloud database, you need to look into another sort of hosted database.

(Inspired by dacort's answer)
If the database is just used for reads, there is https://github.com/michalc/sqlite-s3-query (full disclosure: written by me)
Modifying an example from its README, say you have my-db.sqlite in a bucket my-bucket in eu-west-2 (and credentials in environment variables):
from sqlite_s3_query import sqlite_s3_query
with \
sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query, \
query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
for row in rows:
print(row)

if all your actions are limited in reading from SQLite, I guess it is possible. But I have no idea if writing is also possible.
in my case, I'm using gdal(requires libgdal), and gdal's /vsis3, /vsis3-streaming (based on /vsicurl) gives you the ability to read SQLite and many other data source from the cloud. if you want to use raw SQLite rather than based on gdal's data source layer, you can just write them to your local database via gdal's API, however, if so, why not just download it and read it?
for me, since I'm working on spatial data and gdal's DataSource provides a lot of APIs to manipulate spatial data, this method works fine. I'm still looking for a good way to write to a cloud based SQLite.
FYI, here's the document of gdal virtual file system
https://gdal.org/user/virtual_file_systems.html

Related

After AWS Glue is finished, how to execute a SQL script or stored procedure?

I'm learning AWS Glue. With traditional ETL a common pattern is to look up the primary key from the destination table to decide if you need to do an update or an insert (aka upsert design pattern). With glue there doesn't seem to be that same control. Plain writing out the dynamic frame is just a insert process. There are two design patterns I can think of how to solve this:
Load the destination as data frame and in spark, left outer join to only insert new rows (how would you update rows if you needed to? delete then insert??? Since I'm new to spark this is most foreign to me)
Load the data into a stage table and then use SQL to perform the final merge
It's this second method that I'm exploring first. How can I in the AWS world execute a SQL script or stored procedure once the AWS Glue job is complete? Do you do a python-shell job, lambda, directly part of glue, some other way?
I have used pymysql library as a zip file uploaded to AWS S3, and configured in the AWS Glue job parameters. And for UPSERTs, I have used INSERT INTO TABLE....ON DUPLICATE KEY.
So based on the primary key validations, the code would either update a record if already exists, or insert a new record. Hope this helps. Please refer this:
import pymysql
rds_host = "rds.url.aaa.us-west-2.rds.amazonaws.com"
name = "username"
password = "userpwd"
db_name = "dbname"
conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
with conn.cursor() as cur:
insertQry="INSERT INTO ZIP_TERR(zip_code, territory_code, "
"territory_name,state) "
"VALUES(zip_code, territory_code, territory_name, state) "
"ON DUPLICATE KEY UPDATE territory_name = "
"VALUES(territory_name), state = VALUES(state);"
cur.execute(insertQry)
conn.commit()
cur.close()
In the above code sample, territory-code, zip-code are primary keys. Please refer here as well: More on looping inserts using a for loops
As always, AWS' changing feature list resolves much of these problems (arising from user demand and common work patterns).
AWS have published documentation on Updating and Inserting new data, using staging tables (which you mentioned in your second strategy).
Generally speaking the most rigorous approach for ETL is to truncate and reload source data, but this depends on your source data. If your source data is a time series dataset spanning billions of records, you may need to use a delta/incremental load pattern.

A good blobstore / memcache solution

Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs

How to load existing db file to memory in Python sqlite3?

I have an existing sqlite3 db file, on which I need to make some extensive calculations. Doing the calculations from the file is painfully slow, and as the file is not large (~10 MB), so there should be no problem to load it into memory.
Is there a Pythonic way to load the existing file into memory in order to speed up the calculations?
Here is the snippet that I wrote for my flask application:
import sqlite3
from io import StringIO
def init_sqlite_db(app):
# Read database to tempfile
con = sqlite3.connect(app.config['SQLITE_DATABASE'])
tempfile = StringIO()
for line in con.iterdump():
tempfile.write('%s\n' % line)
con.close()
tempfile.seek(0)
# Create a database in memory and import from tempfile
app.sqlite = sqlite3.connect(":memory:")
app.sqlite.cursor().executescript(tempfile.read())
app.sqlite.commit()
app.sqlite.row_factory = sqlite3.Row
What about sqlite3.Connection.backup(...)? "This method makes a backup of a SQLite database even while it’s being accessed by other clients, or concurrently by the same connection." Availability: SQLite 3.6.11 or higher. New in version 3.7.
import sqlite3
source = sqlite3.connect('existing_db.db')
dest = sqlite3.connect(':memory:')
source.backup(dest)
sqlite3.Connection.iterdump "[r]eturns an iterator to dump the database in an SQL text format. Useful when saving an in-memory database for later restoration. This function provides the same capabilities as the .dump command in the sqlite3 shell."
Get such an iterator and dump the disk-based database into a memory-based one, and you're ready to compute. When the computation is done, just dump the other way around back to disk.
First you should try and find out what is causing the slowness you are observing. Are you writing to tables? Are your writes within large enough transactions so that you don't save needless temporary results to disk? Can you change writes to go to temporary tables (with pragma temp_store=memory)? Can you live with pragma synchronous=off?
I don't think this functionality is exposed in the Python module, but sqlite has a backup API that sounds like exactly what you are asking for: a way to copy from one database to another (either one of which may be an in-memory database) that works pretty much automatically without any user-visible enumeration of tables. (Maybe APSW exposes this?)
Another option is to create a ram disk (if you have sufficient control of the environment) and copy the file there.
if we must use a python wrapper,then there are no better solution than the two pass, read and write solution.
but beginning with version 3.7.17, SQLite has the option of accessing disk content directly using memory-mapped I/O.sqlite mmap
if you want to use mmap,you have to use the C interface since no wrapper provide it.
and there is another hardware solution,the memory disk.then you have the convenient file IO and the speed of memory.
This has already been answered before, including code examples at In python, how can I load a sqlite db completely to memory before connecting to it?
You do not mention operating system, but one gotcha of Windows XP is that it defaults to a 10MB file cache, no matter how much memory you have. (This made sense in the days when systems came with 64MB etc). This message has several links:
http://marc.info/?l=sqlite-users&m=116743785223905&w=2
Here is a relatively simple way to read a SQLite db into memory. Depending upon your preferences with regard to manipulating data, you either use Pandas dataframe or write your table to a in-memory sqlite3 database. Similarly after manipulating your data you use the same df.to_sqlite approach to store your results back into a db table.
import sqlite3 as lite
from pandas.io.sql import read_sql
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
c = engine.connect()
conmem = c.connection
con = lite.connect('ait.sqlite', isolation_level=None) #Here is the connection to <ait.sqlite> residing on disk
cur = con.cursor()
sqlx = 'SELECT * FROM Table'
df = read_sql(sqlx, con, coerce_float=True, params=None)
#Read SQLite table into a panda dataframe
df.to_sql(con=conmem, name='Table', if_exists='replace', flavor='sqlite')
With the solution of Cenk Alti, I always had a MemoryError with Python 3.7, when the process reached 500MB. Only with the use of the backup functionality of sqlite3 (mentioned by thinwybk), I was able to to load and save bigger SQLite databases. Also you can do the same with just 3 lines of code, both ways.
The answers of #thinwybk and Crooner are both excellent.
When you have multiple connections to :memory: sqlite databases, for instance when using SQLAlchemy together with the source.backup(dest) function, then you may end up in not placing the backup into the "right" memory DB.
This can be fixed using a proper connection string: https://stackoverflow.com/a/65429612/1617295
and does not involve any hack not use of undocumented features.
sqlite supports in-memory databases.
In python, you would use a :memory: database name for that.
Perhaps you could open two databases (one from the file, an empty one in-memory), migrate everything from the file database into memory, then use the in-memory database further to do calculations.

how to generate various database dumps

I have a CSV file and want to generate dumps of the data for sqlite, mysql, postgres, oracle, and mssql.
Is there a common API (ideally Python based) to do this?
I could use an ORM to insert the data into each database and then export dumps, however that would require installing each database. It also seems a waste of resources - these CSV files are BIG.
I am wary of trying to craft the SQL myself because of the variations with each database. Ideally someone has already done this hard work, but I haven't found it yet.
SQLAlchemy is a database library that (as well as ORM functionality) supports SQL generation in the dialects of the all the different databases you mention (and more).
In normal use, you could create a SQL expression / instruction (using a schema.Table object), create a database engine, and then bind the instruction to the engine, to generate the SQL.
However, the engine is not strictly necessary; the dialects each have a compiler that can generate the SQL without a connection; the only caveat being that you need to stop it from generating bind parameters as it does by default:
from sqlalchemy.sql import expression, compiler
from sqlalchemy import schema, types
import csv
# example for mssql
from sqlalchemy.dialects.mssql import base
dialect = base.dialect()
compiler_cls = dialect.statement_compiler
class NonBindingSQLCompiler(compiler_cls):
def _create_crud_bind_param(self, col, value, required=False):
# Don't do what we're called; return a literal value rather than binding
return self.render_literal_value(value, col.type)
recipe_table = schema.Table("recipe", schema.MetaData(), schema.Column("name", types.String(50), primary_key=True), schema.Column("culture", types.String(50)))
for row in [{"name": "fudge", "culture": "america"}]: # csv.DictReader(open("x.csv", "r")):
insert = expression.insert(recipe_table, row, inline=True)
c = NonBindingSQLCompiler(dialect, insert)
c.compile()
sql = str(c)
print sql
The above example actually works; it assumes you know the target database table schema; it should be easily adaptable to import from a CSV and generate for multiple target database dialects.
I am no database wizard, but AFAIK in Python there's not a common API that would do out-of-the-box what you ask for. There is PEP 249 that defines an API that should be used by modules accessing DB's and that AFAIK is used at least by the MySQL and Postgre python modules (here and here) and that perhaps could be a starting point.
The road I would attempt to follow myself - however - would be another one:
Import the CVS nto MySQL (this is just because MySQL is the one I know best and there are tons of material on the net, as for example this very easy recipe, but you could do the same procedure starting from another database).
Generate the MySQL dump.
Process the MySQL dump file in order to modify it to meet SQLite (and others) syntax.
The scripts for processing the dump file could be very compact, although they might somehow be tricky if you use regex for parsing the lines. Here's an example script MySQL → SQLite that I simply pasted from this page:
#!/bin/sh
mysqldump --compact --compatible=ansi --default-character-set=binary mydbname |
grep -v ' KEY "' |
grep -v ' UNIQUE KEY "' |
perl -e 'local $/;$_=<>;s/,\n\)/\n\)/gs;print "begin;\n";print;print "commit;\n"' |
perl -pe '
if (/^(INSERT.+?)\(/) {
$a=$1;
s/\\'\''/'\'\''/g;
s/\\n/\n/g;
s/\),\(/\);\n$a\(/g;
}
' |
sqlite3 output.db
You could write your script in python (in which case you should have a look to re.compile for performance).
The rationale behind my choice would be:
I get the heavy-lifting [importing and therefore data consistency checks + generating starting SQL file] done for me by mysql
I only have to have one database installed.
I have full control on what is happening and the possibility to fine-tune the process.
I can structure my script in such a way that it will be very easy to extend it for other databases (basically I would structure it like a parser that recognises individual fields + a set of grammars - one for each database - that I can select via command-line option)
There is much more documentation on the differences between SQL flavours than on single DB import/export libraries.
EDIT: A template-based approach
If for any reason you don't feel confident enough to write the SQL yourself, you could use a sort of template-based script. Here's how I would do it:
Import and generate a dump of the table in all the 4 DB you are planning to use.
For each DB save the initial part of the dump (with the schema declaration and all the rest) and a single insert instruction.
Write a python script that - for each DB export - will output the "header" of the dump plus the same "saved line" into which you will programmatically replace the values for each line in your CVS file.
The obvious drawback of this approach is that your "template" will only work for one table. The strongest point of it is that writing such script would be extremely easy and quick.
HTH at least a bit!
You could do this - Create SQL tables from CSV files
or Generate Insert Statements from CSV file
or try this Generate .sql from .csv python
Of course you might need to tweak the scripts mentioned to suite your needs.

cx_Oracle and the data source paradigm

There is a Java paradigm for database access implemented in the Java DataSource. This object create a useful abstraction around the creation of database connections. The DataSource object keeps database configuration, but will only create database connections on request. This is allows you to keep all database configuration and initialization code in one place, and makes it easy to change database implementation, or use a mock database for testing.
I currently working on a Python project which uses cx_Oracle. In cx_Oracle, one gets a connection directly from the module:
import cx_Oracle as dbapi
connection = dbapi.connect(connection_string)
# At this point I am assuming that a real connection has been made to the database.
# Is this true?
I am trying to find a parallel to the DataSource in cx_Oracle. I can easily create this by creating a new class and wrapping cx_Oracle, but I was wondering if this is the right way to do it in Python.
You'll find relevant information of how to access databases in Python by looking at PEP-249: Python Database API Specification v2.0. cx_Oracle conforms to this specification, as do many database drivers for Python.
In this specification a Connection object represents a database connection, but there is no built-in pooling. Tools such as SQLAlchemy do provide pooling facilities, and although SQLAlchemy is often billed as an ORM, it does not have to be used as such and offers nice abstractions for use on top of SQL engines.
If you do want to do object-relational-mapping, then SQLAlchemy does the business, and you can consider either its own declarative syntax or another layer such as Elixir which sits on top of SQLAlchemy and provides increased ease of use for more common use cases.
I don't think there is a "right" way to do this in Python, except maybe to go one step further and use another layer between yourself and the database.
Depending on the reason for wanting to use the DataSource concept (which I've only ever come across in Java), SQLAlchemy (or something similar) might solve the problems for you, without you having to write something from scratch.
If that doesn't fit the bill, writing your own wrapper sounds like a reasonable solution.
Yes, Python has a similar abstraction.
This is from our local build regression test, where we assure that we can talk to all of our databases whenever we build a new python.
if database == SYBASE:
import Sybase
conn = Sybase.connect('sybasetestdb','mh','secret')
elif database == POSTRESQL:
import pgdb
conn = pgdb.connect('pgtestdb:mh:secret')
elif database == ORACLE:
import cx_Oracle
conn = cx_Oracle.connect("mh/secret#oracletestdb")
curs=conn.cursor()
curs.execute('select a,b from testtable')
for row in curs.fetchall():
print row
(note, this is the simple version, in our multidb-aware code we have a dbconnection class that has this logic inside.)
I just sucked it up and wrote my own. It allowed me to add things like abstracting the database (Oracle/MySQL/Access/etc), adding logging, error handling with transaction rollbacks, etc.

Categories

Resources