I'm working in a project which has many bitbake recipes and takes a lot of time - up to 13 hours in some cases. I am new to bitbake and I'm asking for some way to:
check what packages take more to build
check very long dependencies (I have used bitbake -g already)
check if there are any circular dependencies and how to solve them
check if there are recipes which aren't used and how to safely remove them
or any suggestions for using any tools for better managing and understanding recipes.
Or any methods/ways for speeding up the build process in general.
Both suggestions and exact techniques are welcomed.
EDIT date 07/08/2013:
Found this useful tool for tracking dependencies
https://github.com/scottellis/oe-deptools
Description:
./oey.py -h
Usage: ./oey.py [options] [package]
Displays OE build dependencies for a given package or recipe.
Uses the pn-depends.dot file for its raw data.
Generate a pn-depends.dot file by running bitbake -g <recipe>.
Options:
-h Show this help message and exit
-v Show error messages such as recursive dependencies
-r Show reverse dependencies, i.e. packages dependent on package
-f Flat output instead of default tree output
-d <depth> Maximum depth to follow dependencies, default and max is 10
-s Show child package dependencies that are already listed
as direct parent dependencies.
Provide a package name from the generated pn-depends.dot file.
Run the program without a package name to get a list of
available package names.
This is a very broad question!
First, here is a summary on how to inspect your build performance
and dependencies when using the openembedded/yocto project. This answers the first part of the question.
What packages take more time to build?
Use the buildstats with the pybootchartgui tool produce a build chart.
Details:
Set USER_CLASSES += "buildstats" in your $BUILDIR/conf/local.conf
file. This will dump detailed performance data in
$BUILDDIR/tmp/buildstats/<DATE>. Next use the pybootchartgui.py script (in
poky/scripts/pybootchartgui) to generate the chart. This will help you
localize possible bottlenecks in the build. Of course, if you have a
lot of recipes to bake, your chart will be huge. To remove some noise
use the -m MINTIME command line option.
For example:
poky/scripts/pybootchartgui/pybootchartgui.py -m 600 $BUILDDIR/tmp/buildstats/201312310904
will only display tasks (do_compile, do_fetch, etc.) that take longer
than 10 minutes (600 seconds) to run.
How to check for package dependencies?
To explore the dependencies of a particular package use the depexp
utility. For example, to explore the dependencies of eglibc use:
bitbake -g -u depexp eglibc
This will give a better understanding of what each recipe depends on
at both run and compile time.
How to check if there are any circular dependencies and how to solve them?
bitbake automatically detects circular dependencies and prints an error message when such a thing happens.
The error message contains the name of the packages causing this circular dependency.
How to check if there are recipes which aren't used and how to safely remove them?
bitbake calculates dependencies automatically and won't build
packages which aren't needed by your target. If you find some unwanted packages in your image and you wish to remove them:
use bitbake -g -u depexp <TARGET> to inspect how the package gets pulled in
modify the needed recipes in your layer (by creating a bbappend for example) to eliminate the dependency manually
Improving overall build performance
Finally, some tips on how to improve the overall build performance. This answers the second part of the question.
Clean up your dependencies (bitbake -g -u depexp <TARGET> is your friend). Building less stuff takes less time.
bitbake can automatically cache the build outputs and use it for
future builds, this cache is called the "shared-state cache" and is
controlled with the SSTATE_DIR variable in your local.conf.
Set the BB_NUMBER_THREADS and PARALLEL_MAKE variables in your local.conf to match your
machine's resources. These variables control how many tasks are run
in parallel and how many processes 'make' should run in parallel (-j
option) respectively.
Put the 'build' directory on its own disk.
Use the ext4 filesystem without journaling and with these mount
options: noatime,barrier=0,commit=6000. WARNING: This makes your
hdd unreliable in case of power losses. Do not store anything of
value on this hdd.
building images with -dev and/or -dbg packages increases the
do_rootfs task time considerably. Make sure you enable them (see
EXTRA_IMAGE_FEATURES in your local.conf) only when needed.
openembedded and yocto both support icecream (distributed compile). See the icecc class and this post.
Buy a faster machine ;)
References:
Yocto Build Performance Wiki
Bitbake GUI tools
I've tried the distributed compile way years ago ,but the server environment config is not that flexible for CI agent work ,and the build is not that suit for muti-people .
I've tried analyse with the build time with buildstats ,
and found the build time is nearly cost all by compiling the 3-rd party opensource components, which I didn't modify at all.
So The most simple and effective way is to use "sstate-cache" to avoid unmodified components to build again.
The way I use in my work is to trade space for time
Compile the whole bitbake project ,you could find many .tgz files under "build/sstate-cache"
Add and commit all this files to git for tracking the file modification
Compile the whole project again without clean it
cd to your "build/sstate-cache" ,git status and found the files which is modified, delete them and commit to git
Then clean project without .tgzs under "sstate-cache" , The Trade of space for time is done
The "unmodified" files list could be modified according to your own project .
The build time reduce for me is from 1 hour to within 10 minutes
Hope this would be helpful
Related
I am responsible for maintaining a Python environment that is frequently deployed into production. To ensure that the library's dependencies work well together, I use pip to install all the required Python packages, following best practices I pip freeze all of my dependencies into a requirements.txt file.
The benefit of this approach is that I have a very stable environment that is unlikely to break due to a package issue. However, the drawback is that my environment is static while these packages are constantly releasing new versions that could improve performance and most importantly fix vulnerabilities.
I am wondering what the industry standard is for keeping packages up to date in an easy way, perhaps even in an automated way that detects new updates and tests for any potential issues. For instance, apt has a simple apt-get update and apt-get upgrade command to keep your packages constantly updated and be aware of it.
The obvious answer is to just update all packages to the latest official version. My concern is that this can potentially break some dependencies between packages and cause the environment to break.
Can anyone suggest a similar solution for keeping Python packages up to date while ensuring stability?
Well, I can't say what the industry standard is but one of the ways I could think of to update the packages periodically would be as such:
Create a requirements.txt file that contains the packages to update.
Create a python script that contains bash commands to update the packages in the requirements.txt.
Can use pip freeze to update the requirements.txt
Create a cron job or use the python schedule library to trigger a periodic update
You can refer, as an example for
cron job on mac, to: https://www.jcchouinard.com/python-automation-with-cron-on-mac/
schedule in python, to: https://www.geeksforgeeks.org/python-schedule-library/
The above can solve the update issue but may not solve the compatibility issue.
The importance is not so much current but to not have vulnerabilities that may harm customers and your company in any way.
Industristandard is to use Snyk, Safety, Deptrack, Sonatype Lifecycle etc to monitor package/module vulnerabilities. And a Continuous-Integration(CI) system that support this like docker containers in kubernetes.
Deploy the exact version you are running in prod and run vulnerability scans.
(All CI systems make this easy)
(it is important to do on running code because the requirements.txt most likely never show all installed that will be installed)
The vulnerability-scanners (all except owast deptrack) costs money if you like to be current, but they in its simplest form just give you a more refined answer to:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python
You can make your own system based on the safety source
https://github.com/pyupio/safety and the CVE list above and make it fail if any of the versions are found in your code.
When it comes to requirements.txt listings we always use >= in favor of == and deploy every night in test-systems and run test-on them, if they fail we manually have to go though errors and check and pin the releases.
Generally you build Docker images or similar. (There are other VM solutions, but Docker seems to be the most common/popular.) Have apt-get update, apt-get install and pip install -r requirements.txt as part of the Dockerfile, and integrate it into your CI/CD pipeline using GitHub or similar
That builds the image from scratch using a well-defined process. You then run the VM, including all automated unit and integration tests and flag any problems for the Devs. It's then their job to figure out whether they broke something, whether they need a particular version of some dependency, etc.
Once the automated tests pass, the image binary (including its "frozen" dependencies) gets pushed to a place where humans can interact with it and look for anything weird. Typically there are several such environments variously called things like "test", "integration", and "production". The names, numbers, and details vary from place to place - but that triplet is fairly common. They're typically defined as a sequence and gating between them is done manually. This might be a typical spec:
Development - This is the first environment where code gets pushed after the automated testing passes. Failures and weird idiosyncrasies are common. If local developers need to replicate a problem or experiment with a patch that can't be handled on their local machines, they put the code here. Images graduate from Development when the basic manual and automated "smoke tests" all pass and the Dev team agrees the code is stable.
Integration - This environment belongs to QA and dedicated software testers. They may have their own set of scripts to run, which may be automated, manual or ad hoc. This is also a good environment for load testing, or for internal red team attacks on the security, or for testing / exploration by internal trusted users who are willing to risk the occasional crash in order to exercise the newest up-and-coming features. QA flags any problems or oddities and sends issues back to the devs. Images graduate from Integration when QA agrees the code is production-quality.
Production - This environment is running somewhat older code since everything has to go through Dev & QA before reaching it, but it's likely to be quite robust. The binary images here are the only ones that are accessible to the outside world, and the only ones carefully monitored by the SOC.
Since VM images are only compiled once, just prior to sending the code to Dev, all three environments should have the same issues. The problem then becomes getting code through the environments in sequence and doing all the necessary checks in each before the security patches become too outdated... or before the customers get tired of waiting for the new features / bug-fixes.
There are a lot of details in the process, and many questions & answers on StackOverflow's sister site for networking administration
There is a similar question from last year but I don't think the responses are widely applicable and it's not accepted.
Edit: this is in the context of developing small jobs that will only be run in docker in-house; I'm not talking about sharing work with anyone outside a small team, or about projects getting heavy re-use.
What advantage do you see in using requirements.txt to install instead of pip install commands in Dockerfile? I see one: your Dockerfile for various projects is more cookie-cutter.
I'm not even thinking of the use of setup envisioned in the question I linked.
What downside is there to naming the packages in Dockerfile:
RUN pip install --target=/build django==3.0.1 Jinja2==2.11.1 . . .
EDIT 2: #superstormer asked "what are the upsides to putting it in Dockefile" -- fair question. I read co-workers' dockerfiles in Gitlab and have to navigate to the requirements, I don't have it locally in an editor. EDIT3: Note to self: so clone it and look at it in an editor.
First consider going with the flow of the tools:
To manually install those packages, inside or outside a Docker Container, or to test that it works without building a new Docker Image, do pip install -r requirements.txt. You won't have to copy/paste the list of packages.
To "freeze" on specific versions of the packages to make builds more repeatable, pip freeze will create (or augment) that requirements.txt file for you.
PyCharm will look for a requirements.txt file, let you know if your currently installed packages don't match that specification, help you fix that, show you if updated packages are available, and help you update.
Presumably other modern IDEs do the same, but if you're developing in plain text editors, you can still run a script like this to check the installed packages (this is also handy in a git post-checkout hook):
echo -e "\nRequirements diff (requirements.txt vs current pips):"
diff --ignore-case <(sed 's/ *#.*//;s/^ *--.*//;/^$/d' requirements.txt | sort --ignore-case) \
<(pip freeze 2>/dev/null | sort --ignore-case) -yB --suppress-common-lines
Hopefully this makes it clearer that requirements.txt declares required packages and usually the package versions. It's more modular and reusable to keep it separate than embed it inside a Dockerfile.
It's a question of single responsibility.
Dockerfile's job is to package an application up to be built as an image. That is: it should describe every step needed to turn an application into a container image.
requirements.txt's job is to list every dependency of a Python application, regardless of its deployment strategy. Many Python workflows expect a requirements.txt and know how to add new dependencies while updating that requirements.txt file. Many other workflows can at least interoperate with requirements.txt. None of them know how to auto-populate a Dockerfile.
In short, the application is not complete if it does not include a requirements.txt. Including that information in the Dockerfile is like writing documentation that teaches your operations folks how to pull and install every individual dependency while deploying the application, rather than including it in a dependency manager that packages into the binary you deliver to ops.
This is the problem: You try to run a python script that you didn't write yourself, and it is missing a module. Then you solve that problem and try again - now another module is missing. And so on.
Is there anything, a command or something, that can go through the python sources and check that all the necessary modules are available - perhaps even going as far as looking up the dependencies of missing modules online (although that may be rather ambitious)? I think of it as something like 'ldd', but of course this is much more like yum or apt-get in its scope.
Please note, BTW, I'm not talking about the package dependencies like in pip (I think it is called, never used it), but about the logical dependencies in the source code.
There are several packages that analyze code dependencies:
https://docs.python.org/2/library/modulefinder.html
Modulefinder seems like what you want, and reports what modules can't be loaded. It looks like it works transitively from the example, but I am not sure.
https://pypi.org/project/findimports/
This also analyzes transitive imports, I am not sure however what the output is if a module is missing.
... And some more you can find with your favorite search engine
To answer the original question more directly, I think...
lddcollect is available via pip and looks very good.
Emphases mine:
Typical use case: you have a locally compiled application or library with large number of dependencies, and you want to share this binary. This tool will list all shared libraries needed to run it. You can then create a minimal rootfs with just the needed libraries. Alternatively you might want to know what packages need to be installed to run this application (Debian based systems only for now).
There are two modes of operation.
List all shared library files needed to execute supplied inputs
List all packages you need to apt-get install to execute supplied inputs as well as any shared libraries that are needed but are not under package management.
In the first mode it is similar to ldd, except referenced symbolic links to libraries are also listed. In the second mode shared library dependencies that are under package management are not listed, instead the name of the package providing the dependency is listed.
lddcollect --help
Usage: lddcollect [OPTIONS] [LIBS_OR_DIR]...
Find all other libraries and optionally Debian dependencies listed
applications/libraries require to run.
Two ways to run:
1. Supply single directory on input
- Will locate all dynamic libs under that path
- Will print external libs only (will not print any input libs that were found)
2. Supply paths to individual ELF files on a command line
- Will print input libs and any external libs referenced
Prints libraries (including symlinks) that are referenced by input files,
one file per line.
When --dpkg option is supplied, print:
1. Non-dpkg managed files, one per line
2. Separator line: ...
3. Package names, one per line
Options:
--dpkg / --no-dpkg Lookup dpkg libs or not, default: no
--json Output in json format
--verbose Print some info to stderr
--ignore-pkg TEXT Packages to ignore (list package files instead)
--help Show this message and exit.
I can't test it against my current use case for ldd right now, but I quickly ran it against a binary I've built, and it seems to report the same kind of info, in fact almost twice as many lines!
We have a relatively large number of machines in a cluster that run a certain Python-based software for computation in an academic research environment. In order to keep the code base up to date, we are using a build server which makes the current code base available in a directory each time we update a dedicated deployment tag on our Mercurial server. Each machine part of the cluster runs a daily rsync script that just synchronises with the deployment directory on the build server (if there's anything to sync) and restart the process if the code base was updated.
Now this approach I find a bit dated and a slight overkill, and would like to optimise in the following way:
Get rid of the build server as all it actually does is clone the latest code base that has a certain tag attached - it doesn't actually compile or do any additional checks (such as testing) on the code base at all. This would also reduce some pain for us as it'd be one less server to maintain and worry about.
Instad of having the build server, I would like to pull straight from our Mercurial server which hosts the code already. This would reduce the need to duplicate the code base each time we update the deployment tag.
Now I had a bit of a read before on how to install / deploy Python-based software with pip (e.g., How to point pip at a Mercurial branch?). It seems to be the right choice as it supports installing packages straight from a code repository. However, I ran into a few problems that I would need help with. The requirements I have are as follows:
Use Mercurial as a source.
Automated background process to update and install into a custom directory on the file system.
Only pull and update from the repository if there is a new version available.
The following command seems to almost do what I need:
pip install -e hg+https://authkey:anypw#mymercurialserver.hostname.com/Code/package#deployment#egg=package --upgrade --src ~/proj
It pulls the package from the Mercurial server, picks the code base with the tag "deployment" and installs it into proj inside the user's home directory.
The problem, however, is that regardless whether there is an update available or not, pip always uninstalls package and reinstalls it. This makes it difficult to decide whether the process needs to be restarted or not if nothing actually changed. In addition, pip always gets stuck with the the message that hg clone in ./proj/yarely exists with URL... and asks me: What to do? (s)witch, (i)gnore, (w)ipe, (b)ackup. Now this is not ideal, as (1) it would be an automated process without user prompt, and, (2) it should only pull the repository if there was an update in the first place to reduce traffic in the network and not overload our Mercurial server. I believe that in this case, a pull instead of clone if there was a local copy of the repository already would be more appropriate and potentially solve the problem.
I wasn't able to find an elegant and nice solution to this problem. Does anyone have a pointer or suggestion how this could be achieved?
When working with JVM languages a pattern commonly followed is to use a build system (ant+ivy / maven / gradle), where using a build file, the dependencies of your code can be defined. The build system is able to fetch these dependencies when you build your code. Moreover IDEs like Eclipse/IntelliJ are also able to read these build files and continuously build/verify your code as you write it.
How is something similar done while developing in Python? While there may not necessarily be a build step, I want a developer to be able to checkout my code and then run a single bootstrap command that will setup a virtualenv and pull in any thirdy-party dependencies necessary to run the code. I could include some sort of a script to do this, but I am wondering if there is a tool to do this? Most of my search so far has led me to packaging tools, which are more for distribution to end-user than for this purpose (or so I understand).
This is managed by virtualenv and the pip install -r requirements.txt command. More info here: Virtual Environments
I guess requirements.txt is what you are looking for. For example, PyCharm IDE will definitely see it as a dependency list.