System Design For Asynchronously Processing Large XML Files - python

We are running a web application on GCP using Django. One of the features include processing and saving the data in an XML file (provided by the user) into the PostgreSQL Database. We are facing two issues at the moment with the current setup we have:
The task is NOT asynchronous, which means the user has to wait until the app has done processing and saving the data.
The application times out when processing and saving large XML files.
What is a good system architecture we can implement here to solve these issues?
NOTE: We already have Redis + Celery set up, but we are NOT using them for this particular task.
Theoretically, the answer might be as simple as; Use a queue. But the key questions I have are:
How do I pass the file from the user’s local computer 🖥 to the queue? Should I first upload it to Google Cloud Storage? (If I were to upload it to the storage, I guess I would end up facing the same two issues while uploading the file)
Should I further break down the file into smaller chunks and then put it in the queue for faster processing? If so, then how?
Is Redis + Celery good enough for the task? Are there better tools available?
How do I ensure that the entire file has been processed and saved to the database? What if an error occurs while the workers are processing the file?

Related

How to store files on Airflow and then write to GCS?

I want to make a pipeline, where I need to call a API and write the responses to a json file as NEWLINE_DELIMITED_JSON.
Then I want it to move that file to gcs.
As I am working on my shell, it is using my local disk but I want to schedule this process using Airflow. As file size is around 1 GB for a Week period, I don't think it's a good idea to keep files on Airflow. Can anyone suggest some good approach?
I am expecting that the json response is written to a file and that file is moved to GCS.
I dont think its not good idea to keep files on Airflow
Airflow is workflow manager it is not a storage layer thus files are stored only for compute propose while tasks are executed. Tasks should remove the files when finished (ideally by using NamedTemporaryFile thus this happens automatically)
You have several options.
In general it is not recommended to do disk/compute intensive jobs on Airflow workers but that recommendation is based on the assumption that Airflow workers are not designed to do this kind of tasks. Should you want you can configure high intensive disk/compute workers and route your tasks to these workers. I can't say that I recommend this as a rule but it is possible (and sometimes preferred due to simplicity).
As improvement to (1) you can utilize EC2/GCE machine or equivalent service of other cloud providers to do the heavy lifting. In that case your pipeline would be:
start GCE machine -> download data to GCE -> load from GCE to GCS -> terminate GCE machine
As improvement to (2) assuming you have enough memory in the machine you can load data directly in memory to GCS (without storing data on disc)
That can be done by using GcsHook.download_as_byte_array
then your pipeline can be:
start GCE machine -> get data to GCE and load to GCS -> terminate GCE machine
You can decide on writing operator YourApiToGcs to be used in several pipelines or write a Python callable to be used with PythonOperator/ Taskflow.

Is Apache Airflow or Luigi a good tool for this use case?

I'm working at an org that has an embarrassingly large amount of manual tasks that touch multiple databases (internal and external), multiple file types/formats, and a huge amount of datasets. The "general" workflow is to take data from one place, extract/create metadata for it, change the file format into something more standardised, and publish the dataset.
I'm trying to improve automation here and I've been looking at Luigi and Apache-Airflow to try standardise some of the common blocks that get used but I'm not sure if these are the appropriate tools. Before I sink too much time in figuring out these tools I thought I'd ask here.
A dummy example:
Check a REST API end point to see if a dataset has changed (www.some_server.org/api/datasets/My_dataset/last_update_time)
If it's changed download the zip file (My_dataset.zip)
Unzip the file (My_dataset.zip >> my_file1.csv, my_file2.csv ... my_fileN.csv)
Do something with the each CSV; filter, delete, pivot whatever
Combine the csv's and transform into "My_filtered_dataset.json"
For each step create/append a "my_dataset_metadata.csv" file to show things like the processing date, inputs, authors, pipeline version etc.
Upload json and metadata files somewhere else
My end goal would be to quickly swap out blocks, like the "csv_to_json" function with a "csv_to_xlsx" function, for different processing tasks. Also have things like alerting on failure, job visualisation, worker management etc.
Some problems I'm seeing is that Luigi isn't so good at handling dynamic filenames and would struggle to create N branches when I don't know the number of files coming out of the zip file. It's also very basic and doesn't seem to have much community support.
Also from the Airflow docs: "This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. (although there does seem to be some support for this ability with XCOMs)" In my dummy case it I would probably need to share, at least, the filenames and the metadata between each step. Combining all steps into a single operator would kind of defeat the point of Airflow...
Am I misunderstanding things here? Are these tools good for this kind of application? Is this task too simple/complex and should just be stuck into a single script?
With Airflow you can achieve all your goals:
there is sensor operators to wait for a condition: check an API, check if a file exists, run a query on a database and check the result, ...
create a dag to define the dependencies between your tasks, and decide which tasks can run in parallel and which should be run sequentially
a lot of existing operators developed by the community: SSH operators, operators to interact with cloud providers services, ...
a built-in mechanism to send emails on run failure and retry
it's based on python scripts, so you can create a method to create dags dynamically (dags factory), so if you have several dags which share a part of the same logic, you can create them by a conditional method
a built-in messaging system (XCom), to send small data between tasks
a secure way to store your credentials and secrets (Airflow variables and connections)
a modern UI to manage your dag runs and read the logs of your tasks, with Access Control.
you can develop your own plugins and add them to Airflow (ex: UI plugin using FlaskAppBuilder)
you can process each file in a separate task in parallel, on a cluster of nodes (Celery or K8S), using the new feature Dynamic Task Mapping (Airflow >= 2.3)
To pass files between the tasks, which is a basic need for everyone, you can use an external storage service (google GCS, AWS S3, ...) to store the output of each task, use XCom to pass the file path, then read the file in the second task. You can also use a custom backend for XCom to use S3 for example, instead of Airflow metastore db, in this case all the variables and the files passed by XCom will be stored automatically on S3, and there will be no more limit on message size.

Buffer for a continuously generated CSV files to upload to MongoDB

I'm trying to figure out a way in my Flask application to store the multiple csvs that are processed by each thread continuously inside a buffer before uploading it to a Mongo database. The reason I would like to use the buffer is to guarantee some level of persistence and proper handling of errors (in case of network failure, I want to try uploading the csv into Mongo again).
I thought about using a Task Queue such as Celery with a message broker (rabbitmq), but wasn't sure if that was the right way to go. Sorry if this isn't a question suitable for SO -- I just wanted clarification on how to go about doing this. Thank you in advance.
Sounds like you want something like the linux tail command. Tail prints each line of file as soon as it is updated. I'm assuming this csv file is generated by a seperate program that is running at the same time. See How can I tail a log file in Python? on how to implement tail in python.
Note: You might be better off dumping the CSV's in batches it won't be realtime but if thats not important it'll be more efficient

What is the use of Celery in python?

I am confused in celery.Example i want to load a data file and it takes 10 seconds to load without celery.With celery how will the user be benefited? Will it take same time to load data?
Celery, and similar systems like Huey are made to help us distribute (offload) the amount of processes that normally can't execute concurrently on a single machine, or it would lead to significant performance degradation if you do so. The key word here is DISTRIBUTED.
You mentioned downloading of a file. If it is a single file you need to download, and that is all, then you do not need Celery. How about more complex scenario - you need to download 100000 files? How about even more complex - these 100000 files need to be parsed and the parsing process is CPU intensive?
Moreover, Celery will help you with retrying of failed tasks, logging, monitoring, etc.
Normally, the user has to wait to load the data file to be done on the server. But with the help of celery, the operation will be performed on the server and the user will not be involved. Even if the app crashes, that task will be queued.
Celery will keep track of the work you send to it in a database
back-end such as Redis or RabbitMQ. This keeps the state out of your
app server's process which means even if your app server crashes your
job queue will still remain. Celery also allows you to track tasks
that fail.

Data buffering/storage - Python

I am writing an embedded application that reads data from a set of sensors and uploads to a central server. This application is written in Python and runs on a Rasberry Pi unit.
The data needs to be collected every 1 minute, however, the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
Presently, I'm thinking about storing the buffered data in a SQLite database and writing a cron job that can read the data from this database continuously and upload.
Is there a python module that can be used for such feature?
Is there a python module that can be used for such feature?
I'm not aware of any readily available module, however it should be quite straight forward to build one. Given your requirement:
the Internet connection is unstable and I need to buffer the data to a non volatile storage (SD-card) etc. whenever there is no connection. The buffered data should be uploaded as and when the connection comes back.
The algorithm looks something like this (pseudo code):
# buffering module
data = read(sensors)
db.insert(data)
# upload module
# e.g. scheduled every 5 minutes via cron
data = db.read(created > last_successful_upload)
success = upload(data)
if success:
last_successful_upload = max(data.created)
The key is to seperate the buffering and uploading concerns. I.e. when reading data from the sensor don't attempt to immediately upload, always upload from the scheduled module. This keeps the two modules simple and stable.
There are a few edge cases however that you need to concern yourself with to make this work reliably:
insert data while uploading is in progress
SQLlite doesn't support being accessed from multiple processes well
To solve this, you might want to consider another database, or create multiple SQLite databases or even flat files for each batch of uploads.
If you mean a module to work with SQLite database, check out SQLAlchemy.
If you mean a module which can do what cron does, check out sched, a python event scheduler.
However, this looks like a perfect place to implemet a task queue --using a dedicated task broker (rabbitmq, redis, zeromq,..), or python's threads and queues. In general, you want to submit an upload task, and worker thread will pick it up and execute, while the task broker handles retries and failures. All this happens asynchronously, without blocking your main app.
UPD: Just to clarify, you don't need the database if you use a task broker, because a task broker stores the tasks for you.
This is only database work. You can create a master and slave databases in different locations and if one is not on the network, will run with the last synched info.
And when the connection came back hr merge all the data.
Take a look in this answer and search for master and slave database

Categories

Resources