PostgreSQL ETL process on Heroku

PostgreSQL ETL process on Heroku - python

I've been given the task of writing an ETL (Extract, Transform, Load) process between a PostgreSQL 9.1 database hosted on Heroku (we can call it the Master) to another, application-purposed copy of the data that will be in another Heroku (Cedar Stack) hosted PostgreSQL database. Our primary development stack is Python 2.7.2, Django 1.3.3 and PostgreSQL 9.1. As many of you may know, the file system in Heroku is limited in what you can do, and I'm not sure if I completely understand what the rules are for the Ephemeral Filesystem.
So, I'm trying to figure out what my options are here. The obvious one is that I can just write a Django management command and have two separate database connections (and a destination and source set of models) and pump the data over that way and handle the ETL in the process. While effect, my initial tests shows this is a very slow approach. Obviously, a faster approach would be to use PostreSQL COPY functionality. But, normally if I was doing this I would be able to write it out to a file and then use psql to pull it in. Any one done anything like this between two dedicated PostgreSQL databases on Heroku? Any advice or tips will be appreciated.

One solution may be to do the whole ETL process in Postgres land. That is, use the dblink extension to pull data from the source database into the target database. This may or may not be sufficient, but it's worth investigating.
You are free to use the filesystem on a heroku dyno, but I don't think this is a bullet proof solution. The way it works is that you can write to the filesystem just fine, but as soon as that process exits, away goes the data within it. The size of that filesystem is not guaranteed at all, but it is quite large, unless you need multiple hundreds of GBs worth of storage.
Finally, you can speed up some of the process by turning some session level postgres knobs. Instead of listing them here, just read it up on the excellent postgres docs.
EDIT: We now support the Postgres FDW, a better alternative to dblink: http://www.postgresql.org/docs/current/static/postgres-fdw.html

Related

How can I lock files on AWS S3?

By locking, I don't mean the Object Lock S3 makes available. I'm talking about the following situation:
I have multiple (Python) processes that read and write to a single file hosted on S3; maybe the file is an index of sorts that needs to be updated periodically.
The processes run in parallel, so I want to make sure only a single process can ever write to the file at a given time (to avoid concomitant write clobbering data).
If I was writing this to a shared filesystem, I could just ask use flock and use that as a way to sync access to the file, but I can't do that on S3 afaict.
What is the easiest way to lock files on AWS S3?

Unfortunately, AWS S3 does not offer a native way of locking objects - there's no flock analogue, as you pointed out. Instead you have a few options:
Use a database
For example, Postgres offers advisory locks. When setting this up, you will need to do the following:
Make sure all processes can access the database.
Make sure the database can handle the incoming connections (if you're running some type of large processing grid, then you may want to put your Postgres instance behind PGBouncer)
Be careful that you do not close the session from the client before you're done with the lock.
There are a few other caveats you need to consider when using advisory locks - from the Postgres documentation:
Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.
In certain cases using advisory locking methods, especially in queries involving explicit ordering and LIMIT clauses, care must be taken to control the locks acquired because of the order in which SQL expressions are evaluated
Use an external service
I've seen people use something like lockable to solve this issue. From their docs, they seem to have a Python library:
$ pip install lockable-dev
from lockable import Lock
with Lock('my-lock-name'):
#do stuff
If you're not using Python, you can still use their service by hitting some HTTP endpoints:
curl https://api.lockable.dev/v1/acquire/my-lock-name
curl https://api.lockable.dev/v1/release/my-lock-name

sqlite3: avoiding "database locked" collision

I am running two python files on one cpu in parallel, both of which make use of the same sqlite3 database. I am handling the sqlite3 database using sqlalchemy and my understanding is that sqlalchemy handles all the threading database issues within one app. My question is how to handle the access from the two different apps?
One of my two programs is a flask application and the other is a cronjob which updates the database from time to time.
It seems that even read-only tasks on the sqlite database lock the database, meaning that if both apps want to read or write at the same time I get an error.
OperationalError: (sqlite3.OperationalError) database is locked
Lets assume that my cronjob app runs every 5min. How can I make sure that there are no collisions between my two apps? I could write some read flag into a file which I check before accessing the database, but it seems to me there should be a standard way to do this?
Furthermore I am running my app with gunicorn and in principle it is possible to have multiple jobs running... so I eventually want more than 2 parallel jobs for my flask app...
thanks
carl

It's true. Sqlite isn't built for this kind of application. Sqlite is really for lightweight single-threaded, single-instance applications.
Sqlite connections are one per instance, and if you start getting into some kind of threaded multiplexer (see https://www.sqlite.org/threadsafe.html) it'd be possible, but it's more trouble than it's worth. And there are other solutions that provide that function-- take a look at Postgresql or MySQL. Those DB's are open source, are well documented, well supported, and support the kind of concurrency you need.

I'm not sure how SQLAlchemy handles connections, but if you were using Peewee ORM then the solution is quite simple.
When your Flask app initiates a request, you will open a connection to the DB. Then when Flask sends the response, you close the DB.
Similarly, in your cron script, open a connection when you start to use the DB, then close it when the process is finished.
Another thing you might consider is using SQLite in WAL mode. This can improve concurrency. You set the journaling mode with a PRAGMA query when you open your connection.
For more info, see http://charlesleifer.com/blog/sqlite-small-fast-reliable-choose-any-three-/

Twisted + Django as a daemon process plus Django + Apache

I'm working on a distributed system where one process is controlling a hardware piece and I want it to be running as a service. My app is Django + Twisted based, so Twisted maintains the main loop and I access the database (SQLite) through Django, the entry point being a Django Management Command.
On the other hand, for user interface, I am writing a web application on the same Django project on the same database (also using Crossbar as websockets and WAMP server). This is a second Django process accessing the same database.
I'm looking for some validation here. Is anything fundamentally wrong to this approach? I'm particularly scared of issues with database (two different processes accessing it via Django ORM).

Consider that Django, like all WSGI-based web servers, almost always has multiple processes accessing the database. Because a single WSGI process can handle only one connection at a time, it's normal for servers to run multiple processes in parallel when they get any significant amount of traffic.
That doesn't mean there's no cause for concern. You have the database as if data might change between any two calls to it. Familiarize yourself with how Django uses transactions (default is autocommit mode, not atomic requests), and …
and oh, you said sqlite. Yeah, sqlite is probably not the best database to use when you need to write to it from multiple processes. I can imagine that might work for a single-user interface to a piece of hardware, but if you run in to any problems when adding the webapp, you'll want to trade up to a database server like postgresql.

No there is nothing inherently wrong with that approach. We currently use a similar approach for a lot of our work.

Tornado Application design

I'd like people's views on current design I'm considering for a tornado app. Although I'm using mongoDB to store permanent information I currently have the session information as a python data structure that I've simply added within the Application object at initialisation.
I will need to perform some iteration and manipulation of the sessions while the server is running. I keep debating whether to move these to another mongoDB or just keep it as a python structure.
Is there anything wrong with keeping session information this way?

If you store session data in Python your apllication will:
loose it if you stop the Python process;
likely consume more memory as Python isn't very efficient in memory management (and you will have to store all the sessions in memory, not the ones you need right now).
If these are not problems for you you can go with Python structures. But usually these are serious concerns and most of the projects use some external storage for sessions.

Inconsistently slow queries in production (RDS)

First, the server setup:
nginx frontend to the world
gunicorn running a Flask app with gevent workers
Postgres database, connection pooled in the app, running from Amazon RDS, connected with psycopg2 patched to work with gevent
The problem I'm encountering is inexplicably slow queries that are sometimes running on the order of 100ms or so (ideal), but which often spike to 10s or more. While time is a parameter in the query, the difference between the fast and slow query happens much more frequently than a change in the result set. This doesn't seem to be tied to any meaningful spike in CPU usage, memory usage, read/write I/O, request frequency, etc. It seems to be arbitrary.
I've tried:
Optimizing the query - definitely valid, but it runs quite well locally, as well as any time I've tried it directly on the server through psql.
Running on a larger/better RDS instance - I'm currently working on an m3.medium instance with PIOPS and not coming close to that read rate, so I don't think that's the issue.
Tweaking the number of gunicorn workers - I thought this could be an issue, if the psycopg2 driver is having to context switch excessively, but this had no effect.
More - I've been working for a decent amount of time at this, so these were just a couple of the things I've tried.
Does anyone have ideas about how to debug this problem?

This is what shared tenancy gets you, unpredictable results.
What is the size of the data set the queries run on? Although Craig says it sounds like busrty checkpoint activity, that doesn't make sense because this is RDS. It sounds more like cache fallout, e.g; your relations are falling out of cache.
You say you are running piops but m3.medium is not an EBS optimized instance.
You need at least:
High instance level. Make sure your memory is more than the active data set.
EBS optimized instances, see here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
Lots of memory.
PIOPS
By the time you have all of that you will realize you will save a ton of money pushing PostgreSQL (or any database) to bare metal and leaving AWS to what it is good at, Memory and CPU (not IO).

You could try this from within psql to get more details on query timing
EXPLAIN sql_statement
Also turn on more database logging. mysql has slow query analysis, maybe PostgreSQL has an equivalent.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PostgreSQL ETL process on Heroku - python

Related

How can I lock files on AWS S3?

sqlite3: avoiding "database locked" collision

Twisted + Django as a daemon process plus Django + Apache

Tornado Application design

Inconsistently slow queries in production (RDS)

Categories

Resources