How can I lock files on AWS S3?

How can I lock files on AWS S3? - python

By locking, I don't mean the Object Lock S3 makes available. I'm talking about the following situation:
I have multiple (Python) processes that read and write to a single file hosted on S3; maybe the file is an index of sorts that needs to be updated periodically.
The processes run in parallel, so I want to make sure only a single process can ever write to the file at a given time (to avoid concomitant write clobbering data).
If I was writing this to a shared filesystem, I could just ask use flock and use that as a way to sync access to the file, but I can't do that on S3 afaict.
What is the easiest way to lock files on AWS S3?

Unfortunately, AWS S3 does not offer a native way of locking objects - there's no flock analogue, as you pointed out. Instead you have a few options:
Use a database
For example, Postgres offers advisory locks. When setting this up, you will need to do the following:
Make sure all processes can access the database.
Make sure the database can handle the incoming connections (if you're running some type of large processing grid, then you may want to put your Postgres instance behind PGBouncer)
Be careful that you do not close the session from the client before you're done with the lock.
There are a few other caveats you need to consider when using advisory locks - from the Postgres documentation:
Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.
In certain cases using advisory locking methods, especially in queries involving explicit ordering and LIMIT clauses, care must be taken to control the locks acquired because of the order in which SQL expressions are evaluated
Use an external service
I've seen people use something like lockable to solve this issue. From their docs, they seem to have a Python library:
$ pip install lockable-dev
from lockable import Lock
with Lock('my-lock-name'):
#do stuff
If you're not using Python, you can still use their service by hitting some HTTP endpoints:
curl https://api.lockable.dev/v1/acquire/my-lock-name
curl https://api.lockable.dev/v1/release/my-lock-name

Related

Minimal example of uWSGI's shared memory?

Does anyone have a minimal working example of how to use uWSGI to share memory across requests in say Django?
I have a large file in proprietary format (not database-compatible) that I need to load for each request.
An instagram post got me thinking which states:
For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.
How would you set something like that up?

There are multiple ways to handle this:
Share by "abusing" copy-on-write for read-only data
If your data is read-only, you can leverage the fact that uWSGI is executing your python code to get the application before forking into multiple processes. This means all the data that is already loaded before the fork happens will be shared with all your processes.
This can be a great tool because you don't have to do anything dealing with multi-processing to enjoy this mechanism. But be careful, as soon as any process writes to this data, it will copy it first to get its own local version.
Django doesn't make it easy because all the views are lazy. This means django won't try to run code related to your view when application is created. Therefore to enjoy the pre-fork sharing you need to load the data in a code outside your views. For instance you can consider loading the data right before or after you built your application object (like in the gist linked by #john-strood).
Use uWSGI cache framework
If you need to write to this data, a first solution is to use uWSGI cache framework. It's fairly easy to use. You need to configure in advance how much memory you need, and then all your processes can read and write to it. You don't have to deal with locking or other multi-processing related issues.
The drawback is that you still incur IO latency between your processes and uWSGI cache's process. This is insignificant for tiny chunks of data, but would be prohibitive for gigabytes.
Use shared memory manually
As a last resort, if your data is not read-only and you need to load large chunks at all request, so large that even sending through a unix socket would take too long, then you need to load your data directly in the shared memory space. Here uWSGI won't help, and you will have to deal with locking and multi-processing issues yourself.
You can refer to multiprocessing's shared memory documentation.

Tornado Application design

I'd like people's views on current design I'm considering for a tornado app. Although I'm using mongoDB to store permanent information I currently have the session information as a python data structure that I've simply added within the Application object at initialisation.
I will need to perform some iteration and manipulation of the sessions while the server is running. I keep debating whether to move these to another mongoDB or just keep it as a python structure.
Is there anything wrong with keeping session information this way?

If you store session data in Python your apllication will:
loose it if you stop the Python process;
likely consume more memory as Python isn't very efficient in memory management (and you will have to store all the sessions in memory, not the ones you need right now).
If these are not problems for you you can go with Python structures. But usually these are serious concerns and most of the projects use some external storage for sessions.

Python: What to worry about when mutliple processes are concurrently writing to "shelve"?

For my python application I am thinking of using shelve, part of the standard library. There will be hundreds of processes, each writing something to the same shelve object. The writing will always be to add a new key,value pair to the shelve. The keys are unique, so no two processes will update the same entry.
What could go wrong in such a scenario?

The shelve documentation is explicit about this.
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
So, without process synchronisation, I wouldn't do it.
How are the processes started? If they are created by a master process then you can look at the multiprocessing module. Use a Queue to which the child processes write back their results, and have the master remove items from the queue and write them to the shelf. Example of this sort of this is at https://stackoverflow.com/a/24501437/21945.
If you have no process hierarchy then you'll need to use locking to control read and write access to the shelf file. If you are using Linux or similar you might use posix_ipc named semaphore.
The other obvious option is to use a database server - Postgresql or similar.

In your case you'd probably have better luck using a more robust kvp store, such as redis. It's pretty easy to setup a local redis service or a remote redis service (such as on AWS's ElastiCache service)

A good persistent synchronous queue in python

I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.

Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.

I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.

This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)

Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.