Data buffering/storage - Python

Data buffering/storage - Python - python

I am writing an embedded application that reads data from a set of sensors and uploads to a central server. This application is written in Python and runs on a Rasberry Pi unit.
The data needs to be collected every 1 minute, however, the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
Presently, I'm thinking about storing the buffered data in a SQLite database and writing a cron job that can read the data from this database continuously and upload.
Is there a python module that can be used for such feature?

Is there a python module that can be used for such feature?
I'm not aware of any readily available module, however it should be quite straight forward to build one. Given your requirement:
the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
The algorithm looks something like this (pseudo code):
# buffering module
data = read(sensors)
db.insert(data)
# upload module
# e.g. scheduled every 5 minutes via cron
data = db.read(created > last_successful_upload)
success = upload(data)
if success:
last_successful_upload = max(data.created)
The key is to seperate the buffering and uploading concerns. I.e. when reading data from the sensor don't attempt to immediately upload, always upload from the scheduled module. This keeps the two modules simple and stable.
There are a few edge cases however that you need to concern yourself with to make this work reliably:
insert data while uploading is in progress
SQLlite doesn't support being accessed from multiple processes well
To solve this, you might want to consider another database, or create multiple SQLite databases or even flat files for each batch of uploads.

If you mean a module to work with SQLite database, check out SQLAlchemy.
If you mean a module which can do what cron does, check out sched, a python event scheduler.
However, this looks like a perfect place to implemet a task queue --using a dedicated task broker (rabbitmq, redis, zeromq,..), or python's threads and queues. In general, you want to submit an upload task, and worker thread will pick it up and execute, while the task broker handles retries and failures. All this happens asynchronously, without blocking your main app.
UPD: Just to clarify, you don't need the database if you use a task broker, because a task broker stores the tasks for you.

This is only database work. You can create a master and slave databases in different locations and if one is not on the network, will run with the last synched info.
And when the connection came back hr merge all the data.
Take a look in this answer and search for master and slave database

Related

Update single database value on a website with many users

For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)

redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right

To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.

Best Way to Handle user triggered task (like import data) in Django

I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.

You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.

Persistant MySQL connection in Python for social media harvesting

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!

I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.

You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.