What are the different runtimes in MLRun? - python

I'm trying to get a feel for how MLRun executes my Python code. What different runtimes are supported and why would I use one vs the other?

MLRun has several different ways to run a piece of code. At this time, the following runtimes are supported:
Batch runtimes
local - execute a Python or shell program in your local environment (i.e. Jupyter, IDE, etc.)
job - run the code in a Kubernetes Pod
dask - run the code as a Dask Distributed job (over Kubernetes)
mpijob - run distributed jobs and Horovod over the MPI job operator, used mainly for deep learning jobs
spark - run the job as a Spark job (using Spark Kubernetes Operator)
remote-spark - run the job on a remote Spark service/cluster (e.g. Iguazio Spark service)
Real-time runtimes
nuclio - real-time serverless functions over Nuclio
serving - higher level real-time Graph (DAG) over one or more Nuclio functions
If you are interested in learning more about each runtime, see the documentation.

Related

Running Jupyter Notebook in GCP on a Schedule

What is the best way to migrate a jupyter notebook in to Google Cloud Platform?
Requirements
I don't want to do a lot of changes to the notebook to get it to run
I want it to be scheduleable, preferably through the UI
I want it to be able to run a ipynb file, not a py file
In AWS it seems like sagemaker is the no brainer solution for this. I want the tool in GCP that gets as close to the specific task without a lot of extras
I've tried the following,
Cloud Function: it seems like it's best for running python scripts, not a notebook, requires you to run a main.py file by default
Dataproc: seems like you can add a notebook to a running instance but it cannot be scheduled
Dataflow: sort of seemed like overkill, like it wasn't the best tool and that it was better suited apache based tools
I feel like this question should be easier, I found this article on the subject:
How to Deploy and Schedule Jupyter Notebook on Google Cloud Platform
He actually doesn't do what the title says, he moves a lot of GCP code in to a main.py to create an instance and he has the instance execute the notebook.
Feel free to correct my perspective on any of this
I use Vertex AI Workbench to run notebooks on GCP. It provides two variants:
Managed Notebooks
User-managed Notebooks
User-managed notebooks creates compute instances at the background and it comes with pre-built packages such as Jupyter Lab, Python, etc and allows customisation. I mainly use for developing Dataflow pipelines.
Other requirement of scheduling - Managed Notebooks supports this feature, refer this documentation (I am yet to try Managed Notebooks):
Use the executor to run a notebook file as a one-time execution or on
a schedule. Choose the specific environment and hardware that you want
your execution to run on. Your notebook's code will run on Vertex AI
custom training, which can make it easier to do distributed training,
optimize hyperparameters, or schedule continuous training jobs. See
Run notebook files with the executor.
You can use parameters in your execution to make specific changes to
each run. For example, you might specify a different dataset to use,
change the learning rate on your model, or change the version of the
model.
You can also set a notebook to run on a recurring schedule. Even while
your instance is shut down, Vertex AI Workbench will run your notebook
file and save the results for you to look at and share with others.

Azure Databricks Python Job

I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
Azure Data Factory
Azure Databricks
Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.
If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.

running python code on distributed cluster

I need to run some numpy computation on 5000 files in parallel using python. I have the sequential single machine version implemented already. What would be the easiest way to run the code in parallel (say using an ec2 cluster)? Should I write my own task scheduler and job distribution code?
You can have a look pscheduler Python module. It will allow you to queue up your jobs and run them sequentially. The number of concurrent processes will depend upon the available CPU cores. This program can easily scale up and submit your jobs to remote machines but then would require all your remote machines to use NFS.
I'll be happy to help you further.

Run multiple Python scripts in Azure (using Docker?)

I have a Python script that consumes an Azure queue, and I would like to scale this easily inside Azure infrastructure. I'm looking for the easiest solution possible to
run the Python script in an environment that is as managed as possible
have a centralized way to see the scripts running and their output, and easily scale the amount of scripts running through a GUI or something very easy to use
I'm looking at Docker at the moment, but this seems very complicated for the extremely simple task I'm trying to achieve. What possible approaches are known to do this? An added bonus would be if I could scale wrt the amount of items on the queue, but it is fine if we'd just be able to manually control the amount of parallelism.
You should have a look at Azure Web Apps, which also support Python.
This would be a managed and scaleable environment and also supports background tasks (WebJobs) with a central logging.
Azure Web Apps also offer a free plan for development and testing.
Per my experience, I think CoreOS on Azure can satisfy your needs. You can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-coreos-how-to/ to know how to get started.
CoreOS is a Linux distribution for running Docker as Linux container, that you can remote access via SSH client like putty. For using Docker, you can search the key words Docker tutorial via Bing to rapidly learning some simple usage that enough for running Python scripts.
Sounds to me like you are describing something like a micro-services architecture. From that perspective, Docker is a great choice. I recommend you consider using an orchestration framework such as Apache Mesos or Docker Swarm which will allow you to run your containers on a cluster of VMs with the ability to easily scale, deploy new versions, rollback and implement load balancing. The schedulers Mesos supports (Marathon and Chronos) also have a Web UI. I believe you can also implement some kind of triggered scaling like you describe but that will probably not be off the shelf.
This does seem like a bit of a learning curve but I think is worth it especially once you start considering the complexities of deploying new versions (with possible rollbacks), monitoring failures and even integrating things like Jenkins and continuous delivery.
For Azure, an easy way to deploy and configure a Mesos or Swarm cluster is by using Azure Container Service (ACS) which does all the hard work of configuring the cluster for you. Find additional info here: https://azure.microsoft.com/en-us/documentation/articles/container-service-intro/

Parallel Computing with Python on a queued cluster

There are lots of different modules for threading/parallelizing python. Dispy and pp/ParallelPython seem especially popular. It looks like these are all designed for a single interface (e.g. desktop) which has many cores/processors. Is there a module which works on massively parallel architectures which are run by queue systems (specifically: SLURM)?
The most used parallel framework on large compute clusters for scientific/technical applications is MPI. The name of the Python package is MPI4py, which is part of SciPy.
MPI offers a high-level API for creating parallel software using messages for communicating over the network; remote process creation, data scatter/gather, reductions, etc. All implementations are able to take advantage of fast and low-latency networks if present. It is fully integrated with all cluster managers, including Slurm.
Via the ParallelPython main page:
"PP is a python module which provides mechanism for parallel execution of python code on SMP (systems with multiple processors or cores) and clusters (computers connected via network)."

Categories

Resources