I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.
Related
I have a python program launching a batch job. The job outputs a json file, I'd like to know what is the easiest way to get this result back to the python program that launched it.
So far I thought of these solutions:
Upload the json file to S3 (pretty heavy)
Display it in the pod logs then read the logs from the python program (pretty hacky/dirty)
Mount a PVC, launch a second pod with the same PVC, and create a shared disk between this pod and the job (pretty overkill)
The json file is pretty lightweight. Isn't there a solution to do something like adding some metadata to the pod when the job completes? The python program can then just poll those metadata.
An easy way not involving any other databases/pods is to run the first pod as an init container, mount a volume that is shared in both containers and use the JSON file in the next python program. (Also, this approach does not need a persistent volume, just a shared one), see this example:
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Also, depending on the complexity of these jobs, would recommend taking a look at Argo workflows or any dag-related job schedulers.
I want to run one of my python scripts using GCP. I am fairly new to GCP so I don't have a lot of idea.
My python script grabs data from BigQuery and perform these tasks
Several data processing operations
Build a ML model using KDTree and few clustering algorithms
Dumping the final result to a Big Query table.
This script needs to run every night .
So far I know I can use VMs , Cloud Run, Cloud function ( not a good option for me as it will take about an hour to finish everything) . What should be the best choice for me to run this?
I came across Dataflow, but I am curious to know if it's possible to run a custom python script that can do all these things in google cloud dataflow (assuming I will have to convert everything into map-reduce format that doesn't seem easy with my code especially the ML part)?
Do you just need a python script to run on a single instance for a couple hours and then terminate?
You could setup a 'basic scaling' app-engine micro-service within your GCP project. The max run-time for taskqueue tasks is 24 hours when using 'basic scaling'.
Requests can run for up to 24 hours. A basic-scaled instance can choose to handle /_ah/start and execute a program or script for many hours without returning an HTTP response code. Task queue tasks can run up to 24 hours.
https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed
This is more of an architectural question. If I should be asking this question elsewhere, please let me know and I shall.
I have a use-case where I need to run the same python script (could be long-running) multiple times based on demand and pass on different parameters to each of them. The trigger for running the script is external and it should be able to pass on a STRING parameter to the script along with the resource configurations each script requires to run. We plan to start using AWS by next month and I was going through the various options I have. AWS Batch and Fargate both seem like feasible options given my requirements of not having to manage the infrastructure, dynamic spawning of jobs and management of jobs via python SDK.
The only problem is that both of these services are not available in India. I need to have my processing servers in India physically. What options do I have? Auto-scaling and Python SDK management (Creation and Deletion of tasks) are my main requirements (preferably containerized).
Why are you restricted to India? Often restrictions are due to data retention, in which case just store your data on Indian servers (S3, DynamoDB etc) and then you are able to run your 'program' in another AWS region
(ubuntu 12.04). I envision some sort of Queue to put thousands of tasks, and have the ec2 isntances plow through it in parallel (100 ec2 instances) where each instance handles one task from the queue.
Also, each ec2 instance to use the image I provide, which will have the binaries and software installed on it for use.
Essentially what I am trying to do is, run 100 processing (a python function using packages that depend on binaries installed on that image) in parallel on Amazon's EC2 for an hour or less, shut them all off, and repeat this process whenever it is needed.
Is this doable? I am using Python Boto to do this.
This is doable. You should look into using SQS. Jobs are placed on a queue and the worker instances pop jobs off the queue and perform the appropriate work. As a job is completed, the worker deletes the job from the queue so no job is run more than once.
You can configure your instances using user-data at boot time or you can bake AMIs with all of your software pre-installed. I recommend Packer for baking AMIs as it works really well and is very scriptable so your AMIs can be rebuilt consistently as things need to be changed.
For turning on and off lots of instances, look into using AutoScaling. Simply set the group's desired capacity to the number of worker instances you want running and it will take care of the rest.
This sounds like it might be easier to with EMR.
You mentioned in comments you are doing computer vision. You can make your job hadoop friendly by preparing a file where each line a base64 encoding of the image file.
You can prepare a simple bootstrap script to make sure each node of the cluster has your software installed. Hadoop streaming will allow you to use your image processing code as is for the job (instead of rewriting in java).
When your job is over, the cluster instances will be shut down. You can also specify your output be streamed directly to an S3 bucket, its all baked in. EMR is also cheap, 100 m1.medium EC2 instances running for an hour will only cost you around 2 dollars according to the most recent pricing: http://aws.amazon.com/elasticmapreduce/pricing/
I have 200,000 URLs that I need to scrape from a website. This website has a very strict scraping policy and you will get blocked if the scraping frequency is 10+ /min. So I need to control my pace. And I am thinking about start a few AWS instances (say 3) to run in parallel.
In this way, the estimated time to collect all the data will be:
200,000 URL / (10 URL/min) = 20,000 min (one instance only)
4.6 days (three instances)
which is a legit amount of time to get my work done.
However, I am thinking about building a framework using boto. That I have a paragraph of code and a queue of input (a list of URLs) in this case. Meanwhile I also don't want to do any damage to their website so I only want to scrape during the night and weekend. So I am thinking about all of this should be controlled on one box.
And the code should look similar like this:
class worker (job, queue)
url = queue.pop()
aws = new AWSInstance()
result aws.scrape(url)
return result
worker1 = new worker()
worker2 = new worker()
worker3 = new worker()
worker1.start()
worker2.start()
worker3.start()
The code above is totally pseudo and my idea is to pass the work to AWS.
Question:
(1) How to use boto to pass the variable/argument to another AWS instance and start a script to work on those variable and .. use boto to retrieve the result back to the master box.
(2) What is the best way to schedule a job only on specific time period inside Python code.
Say only work on 6:00pm to 6:00 am everyday... I don't think the Linux crontab will fit my need in this situation.
Sorry about that if my question is more verbally descriptive and philosophical.. Even if you can offer me any hint or throw away some package/library name that meet my need. I will be gratefully appreciated!
Question: (1) How to use boto to pass the variable/argument to another
AWS instance and start a script to work on those variable
Use shared datasource, such as DynamoDB or messaging framework such as SQS
and .. use boto to retrieve the result back to the master box.
Again, shared datasource, or messaging.
(2) What is the best way to schedule a job only on specific time
period inside Python code. Say only work on 6:00pm to 6:00 am
everyday... I don't think the Linux crontab will fit my need in this
situation.
I think crontab fits well here.