Python: file-based thread-safe queue

Python: file-based thread-safe queue - python

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.

You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Related

Routing Python Logs to Databases Efficiently

I want to route some logs from my application to a database. Now, I know that this isn't exactly the ideal way to store logs, but my use case requires it.
I have also seen how one can write their own database logger as explained here,
python logging to database
This looks great, but given that a large number of logs are generated from an application, I feel like sending as many requests to the database could maybe overwhelm it? It may not be the most efficient solution?
Given that this argument is correct, what are some efficient methods for achieving this?
Some ideas that come to mind are,
Write the logs out to a log file during application run time and develop a script that will parse the file and make bulk inserts to a database.
Build some kind of queue architecture that the logs will be routed to, where each record will be inserted to the database in sequence.
Develop a type of reactive program, that will run in the background and route logs to the database.
etc.
What are some other possibilities that can be explored? Are there any best practices?

The rule of thumb is that DB throughput will be greater
if you can batch N row inserts into a single commit,
rather than doing N separate commits.
Have your app append to a structured log file, such as a .CSV
or an easily parsed logfile format.
Be sure to .flush() before sleeping for a while,
so recent output will be visible to other processes.
Consider making a call to .fsync() every now and again
if durability following power fail matters to the app.
Now you have timestamped structured logs that are safely stored
in the filesystem. Clearly there are other ways, such as 0mq
or Kafka, but FS is simplest and plays nicely with unit tests.
During interactive debugging you can tail -f the file.
Now write a daemon that tail -f's the file and copies new
records to the database. Upon reboot it will .seek() to end
after perhaps copying any trailing lines that are missing from DB.
Use kqueue -style events, or poll every K seconds and then sleep.
You can .stat() the file to learn its current length.
Beware of partial lines, where last character in file is not newline.
Consume all unseen lines, BEGIN a transaction, INSERT each line,
COMMIT the DB transaction, resume the loop.
When you do log rolling, avoid renaming logs.
Prefer log filenames that contain ISO8601 timestamps.
Perhaps you settle on daily logs.
Writer won't append lines past midnight, and will move on
to the next filename. Daemon will notice the newly created
file and will .close() the old one, with optional delete
of ancient logs more than a week old.
Log writers might choose to prepend a hashed checksum
to each message, so the reader can verify it receieved
the whole message intact.
A durable queue like Kafka certainly holds some attraction,
but has more moving pieces.
Maybe implement FS logging, with unit tests, and then
use what you've already learned about the application, when
you refactor to employ a more sophisticated message queueing API.

Data buffering/storage - Python

I am writing an embedded application that reads data from a set of sensors and uploads to a central server. This application is written in Python and runs on a Rasberry Pi unit.
The data needs to be collected every 1 minute, however, the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
Presently, I'm thinking about storing the buffered data in a SQLite database and writing a cron job that can read the data from this database continuously and upload.
Is there a python module that can be used for such feature?

Is there a python module that can be used for such feature?
I'm not aware of any readily available module, however it should be quite straight forward to build one. Given your requirement:
the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
The algorithm looks something like this (pseudo code):
# buffering module
data = read(sensors)
db.insert(data)
# upload module
# e.g. scheduled every 5 minutes via cron
data = db.read(created > last_successful_upload)
success = upload(data)
if success:
last_successful_upload = max(data.created)
The key is to seperate the buffering and uploading concerns. I.e. when reading data from the sensor don't attempt to immediately upload, always upload from the scheduled module. This keeps the two modules simple and stable.
There are a few edge cases however that you need to concern yourself with to make this work reliably:
insert data while uploading is in progress
SQLlite doesn't support being accessed from multiple processes well
To solve this, you might want to consider another database, or create multiple SQLite databases or even flat files for each batch of uploads.

If you mean a module to work with SQLite database, check out SQLAlchemy.
If you mean a module which can do what cron does, check out sched, a python event scheduler.
However, this looks like a perfect place to implemet a task queue --using a dedicated task broker (rabbitmq, redis, zeromq,..), or python's threads and queues. In general, you want to submit an upload task, and worker thread will pick it up and execute, while the task broker handles retries and failures. All this happens asynchronously, without blocking your main app.
UPD: Just to clarify, you don't need the database if you use a task broker, because a task broker stores the tasks for you.

This is only database work. You can create a master and slave databases in different locations and if one is not on the network, will run with the last synched info.
And when the connection came back hr merge all the data.
Take a look in this answer and search for master and slave database

Using a global static variable server wide in Django

I have a very long list of objects that I would like to only load from the db once to memory (Meaning not for each session) this list WILL change it's values and grow over time by user inputs, The reason I need it in memory is because I am doing some complex searches on it and want to give a quick answer back.
My question is how do I load a list on the start of the server and keep it alive through sessions letting them all READ/WRITE to it.
Will it be better to do a heavy SQL search instead of keeping the list alive through my server?

The answer is that this is bad idea, you are opening a pandora's box specially since you need write access as well. However all is not lost. You can quite easily use redis for this task.
Redis is a peristent data store but at the same time everything is held in memory. If the redis server runs on the same device as the web server access is almost instantaneous

Celery Task Grouping/Aggregation

I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert

I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)

An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: file-based thread-safe queue - python

Related

Routing Python Logs to Databases Efficiently

Data buffering/storage - Python

Using a global static variable server wide in Django

Celery Task Grouping/Aggregation

SQLite3 and Multiprocessing

Categories

Resources