Buffer for a continuously generated CSV files to upload to MongoDB - python

I'm trying to figure out a way in my Flask application to store the multiple csvs that are processed by each thread continuously inside a buffer before uploading it to a Mongo database. The reason I would like to use the buffer is to guarantee some level of persistence and proper handling of errors (in case of network failure, I want to try uploading the csv into Mongo again).
I thought about using a Task Queue such as Celery with a message broker (rabbitmq), but wasn't sure if that was the right way to go. Sorry if this isn't a question suitable for SO -- I just wanted clarification on how to go about doing this. Thank you in advance.

Sounds like you want something like the linux tail command. Tail prints each line of file as soon as it is updated. I'm assuming this csv file is generated by a seperate program that is running at the same time. See How can I tail a log file in Python? on how to implement tail in python.
Note: You might be better off dumping the CSV's in batches it won't be realtime but if thats not important it'll be more efficient

Related

Simplest way to retrieve a Job results?

I have a python program launching a batch job. The job outputs a json file, I'd like to know what is the easiest way to get this result back to the python program that launched it.
So far I thought of these solutions:
Upload the json file to S3 (pretty heavy)
Display it in the pod logs then read the logs from the python program (pretty hacky/dirty)
Mount a PVC, launch a second pod with the same PVC, and create a shared disk between this pod and the job (pretty overkill)
The json file is pretty lightweight. Isn't there a solution to do something like adding some metadata to the pod when the job completes? The python program can then just poll those metadata.
An easy way not involving any other databases/pods is to run the first pod as an init container, mount a volume that is shared in both containers and use the JSON file in the next python program. (Also, this approach does not need a persistent volume, just a shared one), see this example:
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Also, depending on the complexity of these jobs, would recommend taking a look at Argo workflows or any dag-related job schedulers.

Routing Python Logs to Databases Efficiently

I want to route some logs from my application to a database. Now, I know that this isn't exactly the ideal way to store logs, but my use case requires it.
I have also seen how one can write their own database logger as explained here,
python logging to database
This looks great, but given that a large number of logs are generated from an application, I feel like sending as many requests to the database could maybe overwhelm it? It may not be the most efficient solution?
Given that this argument is correct, what are some efficient methods for achieving this?
Some ideas that come to mind are,
Write the logs out to a log file during application run time and develop a script that will parse the file and make bulk inserts to a database.
Build some kind of queue architecture that the logs will be routed to, where each record will be inserted to the database in sequence.
Develop a type of reactive program, that will run in the background and route logs to the database.
etc.
What are some other possibilities that can be explored? Are there any best practices?
The rule of thumb is that DB throughput will be greater
if you can batch N row inserts into a single commit,
rather than doing N separate commits.
Have your app append to a structured log file, such as a .CSV
or an easily parsed logfile format.
Be sure to .flush() before sleeping for a while,
so recent output will be visible to other processes.
Consider making a call to .fsync() every now and again
if durability following power fail matters to the app.
Now you have timestamped structured logs that are safely stored
in the filesystem. Clearly there are other ways, such as 0mq
or Kafka, but FS is simplest and plays nicely with unit tests.
During interactive debugging you can tail -f the file.
Now write a daemon that tail -f's the file and copies new
records to the database. Upon reboot it will .seek() to end
after perhaps copying any trailing lines that are missing from DB.
Use kqueue -style events, or poll every K seconds and then sleep.
You can .stat() the file to learn its current length.
Beware of partial lines, where last character in file is not newline.
Consume all unseen lines, BEGIN a transaction, INSERT each line,
COMMIT the DB transaction, resume the loop.
When you do log rolling, avoid renaming logs.
Prefer log filenames that contain ISO8601 timestamps.
Perhaps you settle on daily logs.
Writer won't append lines past midnight, and will move on
to the next filename. Daemon will notice the newly created
file and will .close() the old one, with optional delete
of ancient logs more than a week old.
Log writers might choose to prepend a hashed checksum
to each message, so the reader can verify it receieved
the whole message intact.
A durable queue like Kafka certainly holds some attraction,
but has more moving pieces.
Maybe implement FS logging, with unit tests, and then
use what you've already learned about the application, when
you refactor to employ a more sophisticated message queueing API.

AWS S3 continuous byte stream upload?

I am doing some batch processing and occasionally end up with a corrupt line of (string) data. I would like to upload these to an S3 file.
Now, I would really want to add all the lines to a single file, and upload it in a after my script finished executing, but my client asked me to use a socket connection instead and add each line one by one as they come up, simulating a single slow upload.
It sounds like he's done this before, but I couldn't find any reference for anything like it (not talking about multi-part uploads). Has anyone done something like this before?

Is Logging Module's thread safety, thread safe on the network as well?

We are dumping log's into a single file with timestamps from multiple computers (500) and each entry is less than 4KB's. Python's logging module apparently handles the locking and guarantees Thread-Safety. https://docs.python.org/2/library/logging.html#thread-safety
Thanks #user2357112 for your feedback, here are some more information:
Are you using some sort of network-attached storage?
The storage is a network disk, which shares a writable logfile.txt that can be read/written.
Are the computers logging things locally and synchronizing their log files somehow?
The computers not logging locally but ones they are finished, they are using the 'Logger' to write to the end of the logfile.txt that is shared
How are log records from different computers ending up in the same file?
All of the computers are appending to logfile.txt
So What can go wrong writing to a single file? Or is it safe to use?
So What can go wrong writing to a single file? Or is it safe to use?
At best, the driver handling file access will lock the file exclusively to one process/user. At worst, just because you're appending doesn't mean it will be sequential. for example You mangled lines up with could end like this.
Perhaps a better/safer approach is just something like /myNas/logs/MMDDYYHH/workerPID.log and then have a daily cleaner script merge all of these into a master.log file. In an intermediate processing step, you could read each log, put it into a :memory: sqlite database, sort entries by date & time, and dump it into a consolidated master.log.
Alternatively if real time log monitoring is necessary, I believe Windows has equivalent tools like watch which can follow each file as it is written to disk.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.
You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Categories

Resources