How to store files on Airflow and then write to GCS?

How to store files on Airflow and then write to GCS? - python

I want to make a pipeline, where I need to call a API and write the responses to a json file as NEWLINE_DELIMITED_JSON.
Then I want it to move that file to gcs.
As I am working on my shell, it is using my local disk but I want to schedule this process using Airflow. As file size is around 1 GB for a Week period, I don't think it's a good idea to keep files on Airflow. Can anyone suggest some good approach?
I am expecting that the json response is written to a file and that file is moved to GCS.

I dont think its not good idea to keep files on Airflow
Airflow is workflow manager it is not a storage layer thus files are stored only for compute propose while tasks are executed. Tasks should remove the files when finished (ideally by using NamedTemporaryFile thus this happens automatically)
You have several options.
In general it is not recommended to do disk/compute intensive jobs on Airflow workers but that recommendation is based on the assumption that Airflow workers are not designed to do this kind of tasks. Should you want you can configure high intensive disk/compute workers and route your tasks to these workers. I can't say that I recommend this as a rule but it is possible (and sometimes preferred due to simplicity).
As improvement to (1) you can utilize EC2/GCE machine or equivalent service of other cloud providers to do the heavy lifting. In that case your pipeline would be:
start GCE machine -> download data to GCE -> load from GCE to GCS -> terminate GCE machine
As improvement to (2) assuming you have enough memory in the machine you can load data directly in memory to GCS (without storing data on disc)
That can be done by using GcsHook.download_as_byte_array
then your pipeline can be:
start GCE machine -> get data to GCE and load to GCS -> terminate GCE machine
You can decide on writing operator YourApiToGcs to be used in several pipelines or write a Python callable to be used with PythonOperator/ Taskflow.

Related

Simplest way to retrieve a Job results?

I have a python program launching a batch job. The job outputs a json file, I'd like to know what is the easiest way to get this result back to the python program that launched it.
So far I thought of these solutions:
Upload the json file to S3 (pretty heavy)
Display it in the pod logs then read the logs from the python program (pretty hacky/dirty)
Mount a PVC, launch a second pod with the same PVC, and create a shared disk between this pod and the job (pretty overkill)
The json file is pretty lightweight. Isn't there a solution to do something like adding some metadata to the pod when the job completes? The python program can then just poll those metadata.

An easy way not involving any other databases/pods is to run the first pod as an init container, mount a volume that is shared in both containers and use the JSON file in the next python program. (Also, this approach does not need a persistent volume, just a shared one), see this example:
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Also, depending on the complexity of these jobs, would recommend taking a look at Argo workflows or any dag-related job schedulers.

Is Apache Airflow or Luigi a good tool for this use case?

I'm working at an org that has an embarrassingly large amount of manual tasks that touch multiple databases (internal and external), multiple file types/formats, and a huge amount of datasets. The "general" workflow is to take data from one place, extract/create metadata for it, change the file format into something more standardised, and publish the dataset.
I'm trying to improve automation here and I've been looking at Luigi and Apache-Airflow to try standardise some of the common blocks that get used but I'm not sure if these are the appropriate tools. Before I sink too much time in figuring out these tools I thought I'd ask here.
A dummy example:
Check a REST API end point to see if a dataset has changed (www.some_server.org/api/datasets/My_dataset/last_update_time)
If it's changed download the zip file (My_dataset.zip)
Unzip the file (My_dataset.zip >> my_file1.csv, my_file2.csv ... my_fileN.csv)
Do something with the each CSV; filter, delete, pivot whatever
Combine the csv's and transform into "My_filtered_dataset.json"
For each step create/append a "my_dataset_metadata.csv" file to show things like the processing date, inputs, authors, pipeline version etc.
Upload json and metadata files somewhere else
My end goal would be to quickly swap out blocks, like the "csv_to_json" function with a "csv_to_xlsx" function, for different processing tasks. Also have things like alerting on failure, job visualisation, worker management etc.
Some problems I'm seeing is that Luigi isn't so good at handling dynamic filenames and would struggle to create N branches when I don't know the number of files coming out of the zip file. It's also very basic and doesn't seem to have much community support.
Also from the Airflow docs: "This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. (although there does seem to be some support for this ability with XCOMs)" In my dummy case it I would probably need to share, at least, the filenames and the metadata between each step. Combining all steps into a single operator would kind of defeat the point of Airflow...
Am I misunderstanding things here? Are these tools good for this kind of application? Is this task too simple/complex and should just be stuck into a single script?

With Airflow you can achieve all your goals:
there is sensor operators to wait for a condition: check an API, check if a file exists, run a query on a database and check the result, ...
create a dag to define the dependencies between your tasks, and decide which tasks can run in parallel and which should be run sequentially
a lot of existing operators developed by the community: SSH operators, operators to interact with cloud providers services, ...
a built-in mechanism to send emails on run failure and retry
it's based on python scripts, so you can create a method to create dags dynamically (dags factory), so if you have several dags which share a part of the same logic, you can create them by a conditional method
a built-in messaging system (XCom), to send small data between tasks
a secure way to store your credentials and secrets (Airflow variables and connections)
a modern UI to manage your dag runs and read the logs of your tasks, with Access Control.
you can develop your own plugins and add them to Airflow (ex: UI plugin using FlaskAppBuilder)
you can process each file in a separate task in parallel, on a cluster of nodes (Celery or K8S), using the new feature Dynamic Task Mapping (Airflow >= 2.3)
To pass files between the tasks, which is a basic need for everyone, you can use an external storage service (google GCS, AWS S3, ...) to store the output of each task, use XCom to pass the file path, then read the file in the second task. You can also use a custom backend for XCom to use S3 for example, instead of Airflow metastore db, in this case all the variables and the files passed by XCom will be stored automatically on S3, and there will be no more limit on message size.

System Design For Asynchronously Processing Large XML Files

We are running a web application on GCP using Django. One of the features include processing and saving the data in an XML file (provided by the user) into the PostgreSQL Database. We are facing two issues at the moment with the current setup we have:
The task is NOT asynchronous, which means the user has to wait until the app has done processing and saving the data.
The application times out when processing and saving large XML files.
What is a good system architecture we can implement here to solve these issues?
NOTE: We already have Redis + Celery set up, but we are NOT using them for this particular task.
Theoretically, the answer might be as simple as; Use a queue. But the key questions I have are:
How do I pass the file from the user’s local computer 🖥 to the queue? Should I first upload it to Google Cloud Storage? (If I were to upload it to the storage, I guess I would end up facing the same two issues while uploading the file)
Should I further break down the file into smaller chunks and then put it in the queue for faster processing? If so, then how?
Is Redis + Celery good enough for the task? Are there better tools available?
How do I ensure that the entire file has been processed and saved to the database? What if an error occurs while the workers are processing the file?

Download 5M of 1MB-sized archive files from an external FTP server to AWS S3

The problem
I have to download to AWS S3 a lot of .tar.gz files (5 million), each has an approximate size of 1 Mb, stored on an external FTP server (I don't control it).
My try
I have already implemented a solution based on python's concurrent.futures.ThreadPoolExecutor and s3fs modules. I tested it for a subset of 10K files, and it took around 20 minutes for the full process (download using this approach then store on AWS S3 using s3fs). This means that 10,000 / 20 = 500 archives are processed each minute. For 5 million, it would take 5M / 500 = 10,000 minutes of processing = 7 days. I can't afford waiting this time (for time and costs, and I fear the FTP server breaks the connection with my IP).
For that task, I used an r5.metal instance, one of the most powerful in terms of vCPUs (96) and network performances I could find on t he EC2 catalogue.
My questions
So I ask:
What would be the best solution for this problem?
Is there a solution that takes less than one week?
Are there instances that are better than r5.metal for this job?
Is there a cost-effective and scalable dedicated service on AWS?
In this particular case, what's the most adapted between threading, multiprocessing and asyncio (and other solutions)? Same question for downloading 1000 files, each approximately of size 50 Mb.
Any help is much appreciated.

There are two approaches you might take...
Using Amazon EC2
Pass a sub-list of files (100?) to your Python script. Have it loop through the files, downloading each in turn to the local disk. Then, copy it up to Amazon S3 using boto3.
Do not worry about how to write it as threads or do fancy async stuff. Instead, just run lots of those Python scripts in parallel, each with their own list of files to copy. Once you get enough of them running in parallel (just run the script in the background using &, monitor the instance to determine where the bottleneck lies -- you'll probably find that CPU and RAM isn't the problem -- it's more likely to be the remote FTP server that can only handle a certain volume of queries and/or bandwidth of data.
You should then be able to determine the 'sweet spot' to get the fastest throughput with the minimal cost (if that is even a consideration). You could even run multiple EC2 instances in parallel, each running the script in parallel.
Using AWS Lambda
Push a small list of filenames into an Amazon SQS queue.
Then, create an AWS Lambda function that is triggered from the SQS queue. The function should retrieve the files from the FTP server, save to local disk, then use boto3 to copy them to S3. (Make sure to delete the files after uploading to S3, since there is only limited space in a Lambda function container.)
This will use the parallel capabilities of AWS Lambda to perform the operations in parallel. By default, you can run 1000 Lambda functions in parallel, but you can request an increase in this limit.
Start by testing it with a few files pushed into the SQS queue. If that works, send a few thousand messages and see how well it handles the load. You can also play with memory allocations in Lambda, but the minimum level will probably suffice.
Reconciliation
Assume that files will fail to download. Rather than retrying them, let them fail.
Then, after all the scripts have run (in either EC2 or Lambda), do a reconciliation of the files uploaded to S3 with your master list of files. Note that listing files in S3 can be a little slow (it retrieves 1000 per API call) so you might want to use Amazon S3 Inventory, which can provide a daily CSV file listing all objects.
General
Regardless of which approach you take, things will go wrong. For example, the remote FTP server might only allow a limited number of connections. It might have bandwidth limitations. Downloads will randomly fail. Since this is a one-off activity, it's more important to just get the files downloaded than to make the world's best process. If you don't want to wait 34 days for the download, it's imperative that you get something going quickly, so it is at least downloading while you tweak and improve the process.
Good luck! Let us know how you go!

AWS EC2: How to process queue with 100 parallel ec2 instances?

(ubuntu 12.04). I envision some sort of Queue to put thousands of tasks, and have the ec2 isntances plow through it in parallel (100 ec2 instances) where each instance handles one task from the queue.
Also, each ec2 instance to use the image I provide, which will have the binaries and software installed on it for use.
Essentially what I am trying to do is, run 100 processing (a python function using packages that depend on binaries installed on that image) in parallel on Amazon's EC2 for an hour or less, shut them all off, and repeat this process whenever it is needed.
Is this doable? I am using Python Boto to do this.

This is doable. You should look into using SQS. Jobs are placed on a queue and the worker instances pop jobs off the queue and perform the appropriate work. As a job is completed, the worker deletes the job from the queue so no job is run more than once.
You can configure your instances using user-data at boot time or you can bake AMIs with all of your software pre-installed. I recommend Packer for baking AMIs as it works really well and is very scriptable so your AMIs can be rebuilt consistently as things need to be changed.
For turning on and off lots of instances, look into using AutoScaling. Simply set the group's desired capacity to the number of worker instances you want running and it will take care of the rest.

This sounds like it might be easier to with EMR.
You mentioned in comments you are doing computer vision. You can make your job hadoop friendly by preparing a file where each line a base64 encoding of the image file.
You can prepare a simple bootstrap script to make sure each node of the cluster has your software installed. Hadoop streaming will allow you to use your image processing code as is for the job (instead of rewriting in java).
When your job is over, the cluster instances will be shut down. You can also specify your output be streamed directly to an S3 bucket, its all baked in. EMR is also cheap, 100 m1.medium EC2 instances running for an hour will only cost you around 2 dollars according to the most recent pricing: http://aws.amazon.com/elasticmapreduce/pricing/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store files on Airflow and then write to GCS? - python

Related

Simplest way to retrieve a Job results?

Is Apache Airflow or Luigi a good tool for this use case?

System Design For Asynchronously Processing Large XML Files

Download 5M of 1MB-sized archive files from an external FTP server to AWS S3

AWS EC2: How to process queue with 100 parallel ec2 instances?

Categories

Resources