Can you pre-install libraries on Databricks Pool nodes?

Can you pre-install libraries on Databricks Pool nodes? - python

We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?

You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of preloaded_docker_images attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)
Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
for Python this could be done with pip download --prefer-binary lib1 lib2 ...
for Java/Scala you can use mvn dependency:get -Dartifact=<maven_coordinates>, that will download dependencies into ~/.m2/repository folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/ command
for R, it's slightly more complicated, but is also doable

Related

How to periodically update the dependencies in a production Python environment?

I am responsible for maintaining a Python environment that is frequently deployed into production. To ensure that the library's dependencies work well together, I use pip to install all the required Python packages, following best practices I pip freeze all of my dependencies into a requirements.txt file.
The benefit of this approach is that I have a very stable environment that is unlikely to break due to a package issue. However, the drawback is that my environment is static while these packages are constantly releasing new versions that could improve performance and most importantly fix vulnerabilities.
I am wondering what the industry standard is for keeping packages up to date in an easy way, perhaps even in an automated way that detects new updates and tests for any potential issues. For instance, apt has a simple apt-get update and apt-get upgrade command to keep your packages constantly updated and be aware of it.
The obvious answer is to just update all packages to the latest official version. My concern is that this can potentially break some dependencies between packages and cause the environment to break.
Can anyone suggest a similar solution for keeping Python packages up to date while ensuring stability?

Well, I can't say what the industry standard is but one of the ways I could think of to update the packages periodically would be as such:
Create a requirements.txt file that contains the packages to update.
Create a python script that contains bash commands to update the packages in the requirements.txt.
Can use pip freeze to update the requirements.txt
Create a cron job or use the python schedule library to trigger a periodic update
You can refer, as an example for
cron job on mac, to: https://www.jcchouinard.com/python-automation-with-cron-on-mac/
schedule in python, to: https://www.geeksforgeeks.org/python-schedule-library/
The above can solve the update issue but may not solve the compatibility issue.

The importance is not so much current but to not have vulnerabilities that may harm customers and your company in any way.
Industristandard is to use Snyk, Safety, Deptrack, Sonatype Lifecycle etc to monitor package/module vulnerabilities. And a Continuous-Integration(CI) system that support this like docker containers in kubernetes.
Deploy the exact version you are running in prod and run vulnerability scans.
(All CI systems make this easy)
(it is important to do on running code because the requirements.txt most likely never show all installed that will be installed)
The vulnerability-scanners (all except owast deptrack) costs money if you like to be current, but they in its simplest form just give you a more refined answer to:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python
You can make your own system based on the safety source
https://github.com/pyupio/safety and the CVE list above and make it fail if any of the versions are found in your code.
When it comes to requirements.txt listings we always use >= in favor of == and deploy every night in test-systems and run test-on them, if they fail we manually have to go though errors and check and pin the releases.

Generally you build Docker images or similar. (There are other VM solutions, but Docker seems to be the most common/popular.) Have apt-get update, apt-get install and pip install -r requirements.txt as part of the Dockerfile, and integrate it into your CI/CD pipeline using GitHub or similar
That builds the image from scratch using a well-defined process. You then run the VM, including all automated unit and integration tests and flag any problems for the Devs. It's then their job to figure out whether they broke something, whether they need a particular version of some dependency, etc.
Once the automated tests pass, the image binary (including its "frozen" dependencies) gets pushed to a place where humans can interact with it and look for anything weird. Typically there are several such environments variously called things like "test", "integration", and "production". The names, numbers, and details vary from place to place - but that triplet is fairly common. They're typically defined as a sequence and gating between them is done manually. This might be a typical spec:
Development - This is the first environment where code gets pushed after the automated testing passes. Failures and weird idiosyncrasies are common. If local developers need to replicate a problem or experiment with a patch that can't be handled on their local machines, they put the code here. Images graduate from Development when the basic manual and automated "smoke tests" all pass and the Dev team agrees the code is stable.
Integration - This environment belongs to QA and dedicated software testers. They may have their own set of scripts to run, which may be automated, manual or ad hoc. This is also a good environment for load testing, or for internal red team attacks on the security, or for testing / exploration by internal trusted users who are willing to risk the occasional crash in order to exercise the newest up-and-coming features. QA flags any problems or oddities and sends issues back to the devs. Images graduate from Integration when QA agrees the code is production-quality.
Production - This environment is running somewhat older code since everything has to go through Dev & QA before reaching it, but it's likely to be quite robust. The binary images here are the only ones that are accessible to the outside world, and the only ones carefully monitored by the SOC.
Since VM images are only compiled once, just prior to sending the code to Dev, all three environments should have the same issues. The problem then becomes getting code through the environments in sequence and doing all the necessary checks in each before the security patches become too outdated... or before the customers get tired of waiting for the new features / bug-fixes.
There are a lot of details in the process, and many questions & answers on StackOverflow's sister site for networking administration

How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

While using R scripts with Databricks, we can install R Libraries at a location in dbfs and directly make it accessible to any notebook (running on any cluster - whether the library is installed on that cluster or not), using:
.libpaths('/dbfs/folder_x')
library("libraryName")
Is something similar possible in Databricks for Python Libraries.
Clarifying the requirement more:
I want to install Python Libraries only once (preferably in DBFS), and during Notebook execution, simply import the library from that location (preferably without the need for installation - definitely without reaching PyPI to download it).
Solutions that target similar solutions:
Workspace Libraries - Create a library and refer to it directly. Ideal Solution (But needs installation).
Create .whl files and add them to DBFS. Then when executing the Notebook Job (say using Azure Data Factory), simply append it as a Library. Link here
I do not want to install the library again and again (it takes up time during ADF Pipeline Execution using Job Clusters). What is the ideal solution to install it once, and use it multiple times?
Related:
How to use R libraries in Azure Databricks without depending on CRAN server connectivity?
How to install job dependent libraries and .whl package on Databricks Job cluster

Pandas on OpenShift v3

Now that OpenShift Online V2 has announced its end of service, I am looking to migrate my Python application to OpenShift Online V3, aka OpenShift NextGen. Pandas is a requirement (and listed in requirements.txt)
It already has been non-trivial to get pandas installed in v2 but V3 does not allow manual interaction in the build process (or does it?).
When I try to build my app the build process stops after an hour. pip has downloaded and installed the contents of the requirements.txt and is running setup.py for selected packages. The and of the log file is
Running setup.py install for numpy
Running setup.py install for Bottleneck
Running setup.py install for numexpr
Running setup.py install for pandas
Then the process stops without any error message.
Does anyone have a clue how to build Python applications that require pandas on OpenShift V3?

It is going to be one of two things.
Either compiling Pandas is a huge memory hog, possibly caused by the compiler hitting some pathological case. Or, the size of the generated image at that point exceeds an internal limit and so runs out of disk space allocated.
If it was memory, you would need to increase the memory allocated to the build pod. By default in Online this is 512Mi.
To increase the limit you will need to edit the YAML/JSON for the build configuration from the web console, or from the command line using oc edit.
For YAML, you need to add the following:
resources:
limits:
memory: 1Gi
This is setting the field:
$ oc explain bc.spec.resources.limits FIELD: limits <object>
DESCRIPTION:
Limits describes the maximum amount of compute resources allowed. More
info: http://kubernetes.io/docs/user-guide/compute-resources/
The maximum is 1Gi. It appears an increase to this value does allow the build to complete, where as increasing it to 768Mi wasn't enough.
Do be aware that this takes memory away from the quota for compute-resources-timebound when running and since it is using it all during the build, other things you try and do at the same time could be held up.
FWIW, the image size on a local build, not in Online, only produced:
172.30.1.1:5000/mysite/osv3test latest f323d9b036f6 About an hour ago 910MB
Thus unless intermediary space used before things were cleaned up was an issue, it isn't an issue.
So increasing memory used for the build appears to be the answer.

Using setuptools, how can I download external data upon installation?

I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?

As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.

Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.

Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.

This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.

set up new development environment with virtualenv and pip in one command

We want to automate the setup of new development environments.
There is a project. We use the name scheme coreapp_c1 (short for customer1)
The project is small. It contains configuration (settings.py (django)) and a requirements.txt
We use these steps:
create virtualenv
pip install -e git... #egg=coreapp_c1
pip install -r src/coreapp-c1/requirements.txt
Unfortunately there are some other steps which need to be done:
Create a postgres-DB, insert an init-script, create a rabbitMQ queue, ... which can't be done as normal user.
What is the most phytonic way to do the stuff which needs to be done as root?

Automating the setting-up of development environments in Python as well as other languages is getting a lot of attention recently. As a result, there is a variety of solutions available depending on your requirements.
Generally, there is a trade-off involved with these solutions: The more isolation a given solution provides, the more overhead it has. Following is a non-exclusive list of stuff that I have tried and found useful. This list is in ascending order of isolation and therefore overhead:
1) Fabric
"It provides a basic suite of operations for executing local or remote shell commands
(normally or via sudo) and uploading/downloading files, as well as auxiliary
functionality such as prompting the running user for input, or aborting execution."
2) cookiecutter
"A command-line utility that creates projects from cookiecutters (project templates).
E.g. Python package projects, jQuery plugin projects."
3) docker
"An open source project to pack, ship and run any application as a lightweight container."
4) Vagrant
"Create and configure lightweight, reproducible, and portable development environments."
I have included Vagrant for completeness. It is essentially a tool for programmatic creation of virtual machines and therefore lies at the bottom of the list. As you have mentioned, if you are not too excited about the overhead of the VM with its own OS and API stack etc. you should go for one of the first three options - all of which are Pythonic to some extent in being DRY etc.
Personally, based on the limited perspective that I can glean from your problem statement, I would look at cookiecutter.

Vagrant is what you need.
Once created, developlemt environment is easily distribute among developers.
Another suggestion that I used for a long time - create binary packages rpm/deb for your applications.
There are pros and cons for both approaches and moreover they can be applied in conjunction.
In my opinion there is no Pythonic way to do the stuff. Fabric could be handy but it have another application area.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.