I have a python program launching a batch job. The job outputs a json file, I'd like to know what is the easiest way to get this result back to the python program that launched it.
So far I thought of these solutions:
Upload the json file to S3 (pretty heavy)
Display it in the pod logs then read the logs from the python program (pretty hacky/dirty)
Mount a PVC, launch a second pod with the same PVC, and create a shared disk between this pod and the job (pretty overkill)
The json file is pretty lightweight. Isn't there a solution to do something like adding some metadata to the pod when the job completes? The python program can then just poll those metadata.
An easy way not involving any other databases/pods is to run the first pod as an init container, mount a volume that is shared in both containers and use the JSON file in the next python program. (Also, this approach does not need a persistent volume, just a shared one), see this example:
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Also, depending on the complexity of these jobs, would recommend taking a look at Argo workflows or any dag-related job schedulers.
Related
I want to make a pipeline, where I need to call a API and write the responses to a json file as NEWLINE_DELIMITED_JSON.
Then I want it to move that file to gcs.
As I am working on my shell, it is using my local disk but I want to schedule this process using Airflow. As file size is around 1 GB for a Week period, I don't think it's a good idea to keep files on Airflow. Can anyone suggest some good approach?
I am expecting that the json response is written to a file and that file is moved to GCS.
I dont think its not good idea to keep files on Airflow
Airflow is workflow manager it is not a storage layer thus files are stored only for compute propose while tasks are executed. Tasks should remove the files when finished (ideally by using NamedTemporaryFile thus this happens automatically)
You have several options.
In general it is not recommended to do disk/compute intensive jobs on Airflow workers but that recommendation is based on the assumption that Airflow workers are not designed to do this kind of tasks. Should you want you can configure high intensive disk/compute workers and route your tasks to these workers. I can't say that I recommend this as a rule but it is possible (and sometimes preferred due to simplicity).
As improvement to (1) you can utilize EC2/GCE machine or equivalent service of other cloud providers to do the heavy lifting. In that case your pipeline would be:
start GCE machine -> download data to GCE -> load from GCE to GCS -> terminate GCE machine
As improvement to (2) assuming you have enough memory in the machine you can load data directly in memory to GCS (without storing data on disc)
That can be done by using GcsHook.download_as_byte_array
then your pipeline can be:
start GCE machine -> get data to GCE and load to GCS -> terminate GCE machine
You can decide on writing operator YourApiToGcs to be used in several pipelines or write a Python callable to be used with PythonOperator/ Taskflow.
I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.
I'm trying to figure out a way in my Flask application to store the multiple csvs that are processed by each thread continuously inside a buffer before uploading it to a Mongo database. The reason I would like to use the buffer is to guarantee some level of persistence and proper handling of errors (in case of network failure, I want to try uploading the csv into Mongo again).
I thought about using a Task Queue such as Celery with a message broker (rabbitmq), but wasn't sure if that was the right way to go. Sorry if this isn't a question suitable for SO -- I just wanted clarification on how to go about doing this. Thank you in advance.
Sounds like you want something like the linux tail command. Tail prints each line of file as soon as it is updated. I'm assuming this csv file is generated by a seperate program that is running at the same time. See How can I tail a log file in Python? on how to implement tail in python.
Note: You might be better off dumping the CSV's in batches it won't be realtime but if thats not important it'll be more efficient
I use a python script to insert data in my database (using pandas and sqlalchemy). The script read from various sources, clean the data and insert it in the database. I plan on running this script once in a while to completely override the existing database with more recent data.
At first I wanted to have a single service and simply add an endpoint that would require higher privileges to run the script. But in the end that looks a bit odd and, more importantly, that python script is using quite a lot of memory (~700M) which makes me wonder how I should configure my deployment.
Increasing the memory limit of my pod for this (once in a while) operation looks like a bad idea to me, but I'm quite new to Kubernetes, so maybe I'm wrong. Thus this question.
So what would be a good (better) solution? Run another service just for that, simply connect to my machine and run the update manually using the python script directly?
To run on demand
https://kubernetes.io/docs/concepts/workloads/controllers/job/.
This generates a Pod that runs till completion (exit) only once - a Job.
To run on schedule
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Every time when hitting a schedule, this generates the new, separate Job.
I have an Apache-based PHP website that:
Accepts a document uploaded by a user (typically a text file)
Runs a Python script over the uploaded file using shell_exec (this produces a text file output locally on the disk); the PHP code that runs the shell_exec looks something like $result = shell_exec python3 memory_hog_script.py user_uploaded_text_file.txt
Shows the text output to the user
The Python script executed by PHP's shell_exec typically takes 5-10 seconds to be run, but uses open source libraries that take up a lot of memory/CPU (e.g. 50-70%) while it's being run.
More commonly now, I have multiple users uploading a file at the exact same time. When a handful of users upload a file at the same time, my CPU gets overloaded since a separate instance of the heavy memory hogging python file gets executed via shell_exec for each user. When this happens, the server crashes.
I'm trying to figure out what the best way is to handle this, and I had a few ideas but wasn't sure which one is the best approach / is feasible:
Before shell_exec gets run by PHP, check CPU usage and see if it is greater than a certain threshold (e.g. 70%); if so, wait 10 seconds and try again
Alternatively, I can check the database for current active uploads; if there are more than a certain threshold (e.g. 5 uploads), then do something
Create some sort of queueing system; I'm not sure what my options here, but I imagine there is a way to separate out the shell_exec into some kind of managed queue - I'm very happy to look into new technologies here, so if there are any approaches you can introduce me to, I'd love to dig deeper!
Any input would be much appreciated! Thank you :)