This is more of an architectural question. If I should be asking this question elsewhere, please let me know and I shall.
I have a use-case where I need to run the same python script (could be long-running) multiple times based on demand and pass on different parameters to each of them. The trigger for running the script is external and it should be able to pass on a STRING parameter to the script along with the resource configurations each script requires to run. We plan to start using AWS by next month and I was going through the various options I have. AWS Batch and Fargate both seem like feasible options given my requirements of not having to manage the infrastructure, dynamic spawning of jobs and management of jobs via python SDK.
The only problem is that both of these services are not available in India. I need to have my processing servers in India physically. What options do I have? Auto-scaling and Python SDK management (Creation and Deletion of tasks) are my main requirements (preferably containerized).
Why are you restricted to India? Often restrictions are due to data retention, in which case just store your data on Indian servers (S3, DynamoDB etc) and then you are able to run your 'program' in another AWS region
Related
I'm working at an org that has an embarrassingly large amount of manual tasks that touch multiple databases (internal and external), multiple file types/formats, and a huge amount of datasets. The "general" workflow is to take data from one place, extract/create metadata for it, change the file format into something more standardised, and publish the dataset.
I'm trying to improve automation here and I've been looking at Luigi and Apache-Airflow to try standardise some of the common blocks that get used but I'm not sure if these are the appropriate tools. Before I sink too much time in figuring out these tools I thought I'd ask here.
A dummy example:
Check a REST API end point to see if a dataset has changed (www.some_server.org/api/datasets/My_dataset/last_update_time)
If it's changed download the zip file (My_dataset.zip)
Unzip the file (My_dataset.zip >> my_file1.csv, my_file2.csv ... my_fileN.csv)
Do something with the each CSV; filter, delete, pivot whatever
Combine the csv's and transform into "My_filtered_dataset.json"
For each step create/append a "my_dataset_metadata.csv" file to show things like the processing date, inputs, authors, pipeline version etc.
Upload json and metadata files somewhere else
My end goal would be to quickly swap out blocks, like the "csv_to_json" function with a "csv_to_xlsx" function, for different processing tasks. Also have things like alerting on failure, job visualisation, worker management etc.
Some problems I'm seeing is that Luigi isn't so good at handling dynamic filenames and would struggle to create N branches when I don't know the number of files coming out of the zip file. It's also very basic and doesn't seem to have much community support.
Also from the Airflow docs: "This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. (although there does seem to be some support for this ability with XCOMs)" In my dummy case it I would probably need to share, at least, the filenames and the metadata between each step. Combining all steps into a single operator would kind of defeat the point of Airflow...
Am I misunderstanding things here? Are these tools good for this kind of application? Is this task too simple/complex and should just be stuck into a single script?
With Airflow you can achieve all your goals:
there is sensor operators to wait for a condition: check an API, check if a file exists, run a query on a database and check the result, ...
create a dag to define the dependencies between your tasks, and decide which tasks can run in parallel and which should be run sequentially
a lot of existing operators developed by the community: SSH operators, operators to interact with cloud providers services, ...
a built-in mechanism to send emails on run failure and retry
it's based on python scripts, so you can create a method to create dags dynamically (dags factory), so if you have several dags which share a part of the same logic, you can create them by a conditional method
a built-in messaging system (XCom), to send small data between tasks
a secure way to store your credentials and secrets (Airflow variables and connections)
a modern UI to manage your dag runs and read the logs of your tasks, with Access Control.
you can develop your own plugins and add them to Airflow (ex: UI plugin using FlaskAppBuilder)
you can process each file in a separate task in parallel, on a cluster of nodes (Celery or K8S), using the new feature Dynamic Task Mapping (Airflow >= 2.3)
To pass files between the tasks, which is a basic need for everyone, you can use an external storage service (google GCS, AWS S3, ...) to store the output of each task, use XCom to pass the file path, then read the file in the second task. You can also use a custom backend for XCom to use S3 for example, instead of Airflow metastore db, in this case all the variables and the files passed by XCom will be stored automatically on S3, and there will be no more limit on message size.
Hi All I am planning to build a system for my team where we can start a AWS batch infra - run a task - once job done destroy the infra.
I am thinking of : Make file steps:- 1. Terraform apply AWS batch infra, 2. Run the task, 3. Check on regular interval if the task is complete 4. If the task is complete destroy the infra.
What is most efficient way for doing this. Given our team would need wide variety of task to run on the AWS batch, we want to automate where we can just make one command do this.
Should we explore Airflow for this? Or is there better way to so this? Your thoughts would be highly appreciated. Thank you
Many companies have chosen to build their own Terraform Automation & Collaboration Software (TACOS) but it's a lot less work to use an existing service such as the open source atlantis or an enterprise saas platform such as spacelift or terraform cloud.
However, if you were to create your own, you would need to confirm plans safely. The tools above can use rego from open policy agent to do so.
From your workflow, it sounds like you simply need a tool to auto apply your changes. I have seen a cron job running in jenkins that can do the trick. You can also run a cron ecs or ecs fargate task on a scheduled interval. Airflow seems like overkill.
If I were you, I'd strongly look at all the options and list the pros and cons of each before considering rolling your own. I'm interested to know if the above services have shortcomings that warrant your team to build a new service.
I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.
I have a simple python script that I would like to run thousands of it's instances on GCP (at the same time). This script is triggered by the $Universe scheduler, something like "python main.py --date '2022_01'".
What architecture and technology I have to use to achieve this.
PS: I cannot drop $Universe but I'm not against suggestions to use another technologies.
My solution:
I already have a $Universe server running all the time.
Create Pub/Sub topic
Create permanent Compute Engine that listen to Pub/Sub all the time
$Universe send thousand of events to Pub/Sub
Compute engine trigger the creation of a Python Docker Image on another Compute Engine
Scale the creation of the Docker images (I don't know how to do it)
Is it a good architecture?
How to scale this kind of process?
Thank you :)
It might be very difficult to discuss architecture and design questions, as they usually are heavy dependent on the context, scope, functional and non functional requirements, cost, available skills and knowledge and so on...
Personally I would prefer to stay with entirely server-less approach if possible.
For example, use a Cloud Scheduler (server less cron jobs), which sends messages to a Pub/Sub topic, on the other side of which there is a Cloud Function (or something else), which is triggered by the message.
Should it be a Cloud Function, or something else, what and how should it do - depends on you case.
As I understand, you will have a lot of simultaneous call on a custom python code trigger by an orchestrator ($Universe) and you want it on GCP platform.
Like #al-dann, I would go to serverless approach in order to reduce the cost.
As I also understand, pub sub seems to be not necessary, you will could easily trigger the function from any HTTP call and will avoid Pub Sub.
PubSub is necessary only to have some guarantee (at least once processing), but you can have the same behaviour if the $Universe validate the http request for every call (look at http response code & body and retry if not match the expected result).
If you want to have exactly once processing, you will need more tooling, you are close to event streaming (that could be a good use case as I also understand). In that case in a full GCP, I will go to pub / sub & Dataflow that can guarantee exactly once, or Kafka & Kafka Streams or Flink.
If at least once processing is fine for you, I will go http version that will be simple to maintain I think. You will have 3 serverless options for that case :
App engine standard: scale to 0, pay for the cpu usage, can be more affordable than below function if the request is constrain to short period (few hours per day since the same hardware will process many request)
Cloud Function: you will pay per request(+ cpu, memory, network, ...) and don't have to think anything else than code but the code executed is constrain on a proprietary solution.
Cloud run: my prefered one since it's the same pricing than cloud function but you gain the portability, the application is a simple docker image that you can move easily (to kubernetes, compute engine, ...) and change the execution engine depending on cost (if the load change between the study and real world).
I am developing a reporting service (i.e. Database reports via email) for a project on Google App Engine, naturally using the Google Cloud Platform.
I am using Python and Django but I feel that may be unimportant to my question specifically. I want to be able to allow users of my application schedule specific cron reports to send off at specified times of the day.
I know this is completely possible by running a cron on GAE on a minute-by-minute basis (using cron.yaml since I'm using Python) and providing the logic to determine which reports to run in whatever view I decide to make the cron hit, but this seems terribly inefficient to me, and seeing as the best answer I have found suggests doing the same thing (Adding dynamic cron jobs to GAE), I wanted an "updated" suggestion.
Is there at this point in time a better option than running a cron every minute and checking a DB full of client entries to determine which report to fire off?
You may want to have a look at the new Google Cloud Scheduler service (in beta at the moment), which is a fully managed cron job service. It allows you to create cron jobs programmatically via its REST API. So you could create a specific cron job per customer with the appropriate schedule to fit you needs.
Given this limit, my guess would be NO
Free applications can have up to 20 scheduled tasks. Paid applications can have up to 250 scheduled tasks.
https://cloud.google.com/appengine/docs/standard/python/config/cronref#limits
Another version of your minute-by-minute workaround would be a daily cron task that finds everyone that wants to be launched that day, and then use the _eta argument to pinpoint the precise moment in each day for each task to launch.