Pandas on OpenShift v3

Pandas on OpenShift v3 - python

Now that OpenShift Online V2 has announced its end of service, I am looking to migrate my Python application to OpenShift Online V3, aka OpenShift NextGen. Pandas is a requirement (and listed in requirements.txt)
It already has been non-trivial to get pandas installed in v2 but V3 does not allow manual interaction in the build process (or does it?).
When I try to build my app the build process stops after an hour. pip has downloaded and installed the contents of the requirements.txt and is running setup.py for selected packages. The and of the log file is
Running setup.py install for numpy
Running setup.py install for Bottleneck
Running setup.py install for numexpr
Running setup.py install for pandas
Then the process stops without any error message.
Does anyone have a clue how to build Python applications that require pandas on OpenShift V3?

It is going to be one of two things.
Either compiling Pandas is a huge memory hog, possibly caused by the compiler hitting some pathological case. Or, the size of the generated image at that point exceeds an internal limit and so runs out of disk space allocated.
If it was memory, you would need to increase the memory allocated to the build pod. By default in Online this is 512Mi.
To increase the limit you will need to edit the YAML/JSON for the build configuration from the web console, or from the command line using oc edit.
For YAML, you need to add the following:
resources:
limits:
memory: 1Gi
This is setting the field:
$ oc explain bc.spec.resources.limits FIELD: limits <object>
DESCRIPTION:
Limits describes the maximum amount of compute resources allowed. More
info: http://kubernetes.io/docs/user-guide/compute-resources/
The maximum is 1Gi. It appears an increase to this value does allow the build to complete, where as increasing it to 768Mi wasn't enough.
Do be aware that this takes memory away from the quota for compute-resources-timebound when running and since it is using it all during the build, other things you try and do at the same time could be held up.
FWIW, the image size on a local build, not in Online, only produced:
172.30.1.1:5000/mysite/osv3test latest f323d9b036f6 About an hour ago 910MB
Thus unless intermediary space used before things were cleaned up was an issue, it isn't an issue.
So increasing memory used for the build appears to be the answer.

Related

How to periodically update the dependencies in a production Python environment?

I am responsible for maintaining a Python environment that is frequently deployed into production. To ensure that the library's dependencies work well together, I use pip to install all the required Python packages, following best practices I pip freeze all of my dependencies into a requirements.txt file.
The benefit of this approach is that I have a very stable environment that is unlikely to break due to a package issue. However, the drawback is that my environment is static while these packages are constantly releasing new versions that could improve performance and most importantly fix vulnerabilities.
I am wondering what the industry standard is for keeping packages up to date in an easy way, perhaps even in an automated way that detects new updates and tests for any potential issues. For instance, apt has a simple apt-get update and apt-get upgrade command to keep your packages constantly updated and be aware of it.
The obvious answer is to just update all packages to the latest official version. My concern is that this can potentially break some dependencies between packages and cause the environment to break.
Can anyone suggest a similar solution for keeping Python packages up to date while ensuring stability?

Well, I can't say what the industry standard is but one of the ways I could think of to update the packages periodically would be as such:
Create a requirements.txt file that contains the packages to update.
Create a python script that contains bash commands to update the packages in the requirements.txt.
Can use pip freeze to update the requirements.txt
Create a cron job or use the python schedule library to trigger a periodic update
You can refer, as an example for
cron job on mac, to: https://www.jcchouinard.com/python-automation-with-cron-on-mac/
schedule in python, to: https://www.geeksforgeeks.org/python-schedule-library/
The above can solve the update issue but may not solve the compatibility issue.

The importance is not so much current but to not have vulnerabilities that may harm customers and your company in any way.
Industristandard is to use Snyk, Safety, Deptrack, Sonatype Lifecycle etc to monitor package/module vulnerabilities. And a Continuous-Integration(CI) system that support this like docker containers in kubernetes.
Deploy the exact version you are running in prod and run vulnerability scans.
(All CI systems make this easy)
(it is important to do on running code because the requirements.txt most likely never show all installed that will be installed)
The vulnerability-scanners (all except owast deptrack) costs money if you like to be current, but they in its simplest form just give you a more refined answer to:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python
You can make your own system based on the safety source
https://github.com/pyupio/safety and the CVE list above and make it fail if any of the versions are found in your code.
When it comes to requirements.txt listings we always use >= in favor of == and deploy every night in test-systems and run test-on them, if they fail we manually have to go though errors and check and pin the releases.

Generally you build Docker images or similar. (There are other VM solutions, but Docker seems to be the most common/popular.) Have apt-get update, apt-get install and pip install -r requirements.txt as part of the Dockerfile, and integrate it into your CI/CD pipeline using GitHub or similar
That builds the image from scratch using a well-defined process. You then run the VM, including all automated unit and integration tests and flag any problems for the Devs. It's then their job to figure out whether they broke something, whether they need a particular version of some dependency, etc.
Once the automated tests pass, the image binary (including its "frozen" dependencies) gets pushed to a place where humans can interact with it and look for anything weird. Typically there are several such environments variously called things like "test", "integration", and "production". The names, numbers, and details vary from place to place - but that triplet is fairly common. They're typically defined as a sequence and gating between them is done manually. This might be a typical spec:
Development - This is the first environment where code gets pushed after the automated testing passes. Failures and weird idiosyncrasies are common. If local developers need to replicate a problem or experiment with a patch that can't be handled on their local machines, they put the code here. Images graduate from Development when the basic manual and automated "smoke tests" all pass and the Dev team agrees the code is stable.
Integration - This environment belongs to QA and dedicated software testers. They may have their own set of scripts to run, which may be automated, manual or ad hoc. This is also a good environment for load testing, or for internal red team attacks on the security, or for testing / exploration by internal trusted users who are willing to risk the occasional crash in order to exercise the newest up-and-coming features. QA flags any problems or oddities and sends issues back to the devs. Images graduate from Integration when QA agrees the code is production-quality.
Production - This environment is running somewhat older code since everything has to go through Dev & QA before reaching it, but it's likely to be quite robust. The binary images here are the only ones that are accessible to the outside world, and the only ones carefully monitored by the SOC.
Since VM images are only compiled once, just prior to sending the code to Dev, all three environments should have the same issues. The problem then becomes getting code through the environments in sequence and doing all the necessary checks in each before the security patches become too outdated... or before the customers get tired of waiting for the new features / bug-fixes.
There are a lot of details in the process, and many questions & answers on StackOverflow's sister site for networking administration

Can you pre-install libraries on Databricks Pool nodes?

We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?

You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of preloaded_docker_images attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)
Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
for Python this could be done with pip download --prefer-binary lib1 lib2 ...
for Java/Scala you can use mvn dependency:get -Dartifact=<maven_coordinates>, that will download dependencies into ~/.m2/repository folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/ command
for R, it's slightly more complicated, but is also doable

Automated deployment and update of Python-based Software with pip

We have a relatively large number of machines in a cluster that run a certain Python-based software for computation in an academic research environment. In order to keep the code base up to date, we are using a build server which makes the current code base available in a directory each time we update a dedicated deployment tag on our Mercurial server. Each machine part of the cluster runs a daily rsync script that just synchronises with the deployment directory on the build server (if there's anything to sync) and restart the process if the code base was updated.
Now this approach I find a bit dated and a slight overkill, and would like to optimise in the following way:
Get rid of the build server as all it actually does is clone the latest code base that has a certain tag attached - it doesn't actually compile or do any additional checks (such as testing) on the code base at all. This would also reduce some pain for us as it'd be one less server to maintain and worry about.
Instad of having the build server, I would like to pull straight from our Mercurial server which hosts the code already. This would reduce the need to duplicate the code base each time we update the deployment tag.
Now I had a bit of a read before on how to install / deploy Python-based software with pip (e.g., How to point pip at a Mercurial branch?). It seems to be the right choice as it supports installing packages straight from a code repository. However, I ran into a few problems that I would need help with. The requirements I have are as follows:
Use Mercurial as a source.
Automated background process to update and install into a custom directory on the file system.
Only pull and update from the repository if there is a new version available.
The following command seems to almost do what I need:
pip install -e hg+https://authkey:anypw#mymercurialserver.hostname.com/Code/package#deployment#egg=package --upgrade --src ~/proj
It pulls the package from the Mercurial server, picks the code base with the tag "deployment" and installs it into proj inside the user's home directory.
The problem, however, is that regardless whether there is an update available or not, pip always uninstalls package and reinstalls it. This makes it difficult to decide whether the process needs to be restarted or not if nothing actually changed. In addition, pip always gets stuck with the the message that hg clone in ./proj/yarely exists with URL... and asks me: What to do? (s)witch, (i)gnore, (w)ipe, (b)ackup. Now this is not ideal, as (1) it would be an automated process without user prompt, and, (2) it should only pull the repository if there was an update in the first place to reduce traffic in the network and not overload our Mercurial server. I believe that in this case, a pull instead of clone if there was a local copy of the repository already would be more appropriate and potentially solve the problem.
I wasn't able to find an elegant and nice solution to this problem. Does anyone have a pointer or suggestion how this could be achieved?

Using setuptools, how can I download external data upon installation?

I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?

As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.

Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.

Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.

This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.

Methods for speeding up build time in a project using bitbake?

I'm working in a project which has many bitbake recipes and takes a lot of time - up to 13 hours in some cases. I am new to bitbake and I'm asking for some way to:
check what packages take more to build
check very long dependencies (I have used bitbake -g already)
check if there are any circular dependencies and how to solve them
check if there are recipes which aren't used and how to safely remove them
or any suggestions for using any tools for better managing and understanding recipes.
Or any methods/ways for speeding up the build process in general.
Both suggestions and exact techniques are welcomed.
EDIT date 07/08/2013:
Found this useful tool for tracking dependencies
https://github.com/scottellis/oe-deptools
Description:
./oey.py -h
Usage: ./oey.py [options] [package]
Displays OE build dependencies for a given package or recipe.
Uses the pn-depends.dot file for its raw data.
Generate a pn-depends.dot file by running bitbake -g <recipe>.
Options:
-h Show this help message and exit
-v Show error messages such as recursive dependencies
-r Show reverse dependencies, i.e. packages dependent on package
-f Flat output instead of default tree output
-d <depth> Maximum depth to follow dependencies, default and max is 10
-s Show child package dependencies that are already listed
as direct parent dependencies.
Provide a package name from the generated pn-depends.dot file.
Run the program without a package name to get a list of
available package names.

This is a very broad question!
First, here is a summary on how to inspect your build performance
and dependencies when using the openembedded/yocto project. This answers the first part of the question.
What packages take more time to build?
Use the buildstats with the pybootchartgui tool produce a build chart.
Details:
Set USER_CLASSES += "buildstats" in your $BUILDIR/conf/local.conf
file. This will dump detailed performance data in
$BUILDDIR/tmp/buildstats/<DATE>. Next use the pybootchartgui.py script (in
poky/scripts/pybootchartgui) to generate the chart. This will help you
localize possible bottlenecks in the build. Of course, if you have a
lot of recipes to bake, your chart will be huge. To remove some noise
use the -m MINTIME command line option.
For example:
poky/scripts/pybootchartgui/pybootchartgui.py -m 600 $BUILDDIR/tmp/buildstats/201312310904
will only display tasks (do_compile, do_fetch, etc.) that take longer
than 10 minutes (600 seconds) to run.
How to check for package dependencies?
To explore the dependencies of a particular package use the depexp
utility. For example, to explore the dependencies of eglibc use:
bitbake -g -u depexp eglibc
This will give a better understanding of what each recipe depends on
at both run and compile time.
How to check if there are any circular dependencies and how to solve them?
bitbake automatically detects circular dependencies and prints an error message when such a thing happens.
The error message contains the name of the packages causing this circular dependency.
How to check if there are recipes which aren't used and how to safely remove them?
bitbake calculates dependencies automatically and won't build
packages which aren't needed by your target. If you find some unwanted packages in your image and you wish to remove them:
use bitbake -g -u depexp <TARGET> to inspect how the package gets pulled in
modify the needed recipes in your layer (by creating a bbappend for example) to eliminate the dependency manually
Improving overall build performance
Finally, some tips on how to improve the overall build performance. This answers the second part of the question.
Clean up your dependencies (bitbake -g -u depexp <TARGET> is your friend). Building less stuff takes less time.
bitbake can automatically cache the build outputs and use it for
future builds, this cache is called the "shared-state cache" and is
controlled with the SSTATE_DIR variable in your local.conf.
Set the BB_NUMBER_THREADS and PARALLEL_MAKE variables in your local.conf to match your
machine's resources. These variables control how many tasks are run
in parallel and how many processes 'make' should run in parallel (-j
option) respectively.
Put the 'build' directory on its own disk.
Use the ext4 filesystem without journaling and with these mount
options: noatime,barrier=0,commit=6000. WARNING: This makes your
hdd unreliable in case of power losses. Do not store anything of
value on this hdd.
building images with -dev and/or -dbg packages increases the
do_rootfs task time considerably. Make sure you enable them (see
EXTRA_IMAGE_FEATURES in your local.conf) only when needed.
openembedded and yocto both support icecream (distributed compile). See the icecc class and this post.
Buy a faster machine ;)
References:
Yocto Build Performance Wiki
Bitbake GUI tools

I've tried the distributed compile way years ago ,but the server environment config is not that flexible for CI agent work ,and the build is not that suit for muti-people .
I've tried analyse with the build time with buildstats ,
and found the build time is nearly cost all by compiling the 3-rd party opensource components, which I didn't modify at all.
So The most simple and effective way is to use "sstate-cache" to avoid unmodified components to build again.
The way I use in my work is to trade space for time
Compile the whole bitbake project ,you could find many .tgz files under "build/sstate-cache"
Add and commit all this files to git for tracking the file modification
Compile the whole project again without clean it
cd to your "build/sstate-cache" ,git status and found the files which is modified, delete them and commit to git
Then clean project without .tgzs under "sstate-cache" , The Trade of space for time is done
The "unmodified" files list could be modified according to your own project .
The build time reduce for me is from 1 hour to within 10 minutes
Hope this would be helpful

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.