I'm cleaning up packaging for a python project I didn't create. Currently, it does some explicitly unsupported magic to get its dependencies from a requirements.txt file. The file looks like it may have been generated by pip freeze; there are fixed versions for everything, and many apparently-extraneous packages listed. I am pretty sure some of these aren't real dependencies, but I don't know which ones.
Given just the source tree, how would I figure out, from scratch, what dependencies ought to be included in install_requires?
As a first stab, I'm grepping for non-stdlib import statements. I hope there's a better way.
There's no way to do this perfectly, because Python is too flexible.
But it's usually possible to do it well enough.
You can use start with the stdlib's modulefinder.
Beyond that, a number of projects—mostly projects designed for building binary executables, installers, etc. for Python apps—have come up with heuristics that go even farther.
These usually work. And, when they fail, you usually immediately spot it on your first test. Even if they aren't sufficient, they're at the very least good sample code. Here are a few off the top of my head:
cx_Freeze
py2exe
py2app
pyInstaller
In case you're wondering why it's impossible:
Even forgetting about the program of dependencies in C extension modules, Python is just too flexible to catch all the ways you could import a module via static analysis.
Sure, you'd have to be dealing with code written by someone crazy enough to use explicitly unsupported magic for no good reason… but if you were, there's nothing to stop someone from writing this instead of import lxml:1
with open('picture.jpg', encoding='cp500') as f:
getattr(sys.modules[11], codecs.encode('vzcbeg_zbqhyr', 'rot13'))(f.read().strip())
In reality, things aren't going to be that bad. But they could easily be too bad for rg import to be sufficient.
You could try to detect all the imports dynamically with a simple import hook, but that's only guaranteed to work if you can exercise 100% of the code paths.
1. Of course this only works if importlib was the 12th module loaded, and if picture.jpg is not a JPEG image but a textfile whose contents are, in EBCDIC, lxml\n
I've had great results with pipreqs that will automatically generate a requirements.txt file from your source code.
pipreqs /home/project/location
Successfully saved requirements file in /home/project/location/requirements.txt
I wrote a tool, realreq, specifically for this issue.
You can install it using pip python3 -m pip install realreq. Using it is easy as:
realreq -s /path/to/your/source
It will then gather your dependencies actually used in your source code.
I mean, the most effective way would honestly be to go through the code line by line and determine what packages may not be needed, what packages need updates, etc. I know Python 2 and 3 both have ModuleFinder which finds all the modules a script needs to successfully compile and run, but I've never used it before, so not sure how effective it is, especially for what you're doing. However, if you're interested, I'll attach the link below.
https://docs.python.org/3/library/modulefinder.html
Related
I want to know if I can create a python script with a folder in the same directory with all the assets of a python module, so when someone wants to use it, they would not have to pip install module, because it would import from the directory.
Yes, you can, but it doesn't mean that you should.
First, ask yourself who is suposed to use that code.
If you plan to give it to consumers, it would be a good idea to use a tool like py2exe and create executable file which would include all modules and not allow for code to be changed.
If you plan to share it with another developer, you might want to look into virtual environments and requirements.txt file.
There are multiple reasons why sharing modules is bad idea:
It is harder to update modules later, at least without upgrading whole project.
It uses more space on version control, which can create issues on huge projects with hundreds of modules and branches
It might be illegal as some licenses specifically forbid including their code in your source code.
The pip install of some module might do different things depending on operating system version or installed packages. The modules on your machine might be suboptimal on someone else's machine, and in some instances might not even work.
And probably more that I can't think of right now.
The only situation where I saw this being unavoidable was when the module didn't support python implementation the application was running on. The module was changed, and its source was put under lib folder with the rest of the libraries.
I think you can add the directory with python modules into PYTHONPATH. Then people want to use those modules just need has this envvar set.
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH
I was playing with Python and Machine Learning. During the experimental phase, I added more and more stuff to the requirements.txt file.
Now that I know what code I want to keep, I have deleted those experiments which are not helpful. Thus, some requirements have become obsolete. I'm now looking for a way to clean up my requirements.txt file.
Initially I thought I could just clear the file entirely and then go through the import statements and let PyCharm add them. However, that's not a good idea, because it will just add the latest version of the library, which isn't always possible. I need some libraries in special versions.
Is there a good way to clean up the requirements? I'm thinking of an action similar to "Optimize imports" (PyCharm) or "remove unused variable" or "Remove unused libraries" (Resharper).
I think I found a pypi package that could be useful - pip_check_reqs.
It looks like there is a tool in it - pip-extra-reqs that
would find anything that is listed in requirements.txt but that is not
imported by sample.
I guess in this example "sample" is your python module:
pip-extra-reqs --ignore-file=sample/tests/* sample
I would give that package a try.
Users should install our python package via pip or it can be cloned from a github repo and installed from source. Users should not be running import Foo from within the source tree directory for a number of reasons, e.g. C extensions are missing (numpy has the same issue: read here). So, we want to check if the user is running import Foo from within the source tree, but how to do this cleanly, efficiently, and robustly with support for Python 3 and 2?
Edit: Note the source tree here is defined as where the code is downloaded too (e.g. via git or from the source archive) and it contrasts with the installation directory where the code is installed too.
We considered the following:
Check for setup.py, or other file like PKG-INFO, which should only be present in the source. It’s not that elegant and checking for the presence of a file is not very cheap, given this check will happen every time someone import Foo. Also there is nothing to stop someone from putting a setup.py outside to the source tree in their lib/python3.X/site-packages/ directory or similar.
Parsing the contents of setup.py for the package name, but it also adds overhead and is not that clean to parse.
Create a dummy flag file that is only present in the source tree.
Some clever, but likely overcomplicated and error-prone, ideas like modifying Foo/__init__.py during installation to note that we are now outside of the source tree.
Since you mention numpy in your comments and wanting to do it like they do but not fully understanding it, I figured I would break that down and see if you could implement a similar process.
__init__.py
The error you are seeking starts here which is what you linked to in your comments and answers so you already know that. It's just attempting an import of __config__.py and failing if it isn't there or can't be imported.
try:
from numpy.__config__ import show as show_config
except ImportError:
msg = """Error importing numpy: you should not try to import numpy from
its source directory; please exit the numpy source tree, and relaunch
your python interpreter from there."""
raise ImportError(msg)
So where does the __config__.py file come from then and how does that help? Let's follow below...
setup.py
When the package is installed, setup is called to run and it in turn does some configuration actions. This is essentially what ensures that the package is properly installed rather than being run from the download directory (which I think is what you are wanting to ensure).
The key here is this line:
config.make_config_py() # installs __config__.py
misc_util.py
That is imported from distutils/misc_util.py which we can follow all the way down to here.
def make_config_py(self,name='__config__'):
"""Generate package __config__.py file containing system_info
information used during building the package.
This file is installed to the
package installation directory.
"""
self.py_modules.append((self.name, name, generate_config_py))
Which is then running here which is writing in that __config__.py file with some system information and your show() function.
Summary
The import of __config__.py is attempted and fails which generates the error you are wanting to raise if setup.py wasn't run, which is what triggers that file to be properly created. That ensures not only that a file check is being done but that the file only exists in the installation directory. It is still some overhead of importing an additional file on every import but no matter what you do you're adding some amount of overhead making this check in the first place.
Suggestions
I think that you could implement a much lighter weight version of what numpy is doing while accomplishing the same thing.
Remove the distutils subfunction and create the checked file within your setup.py file as part of the standard install. It would only exist in the installed directory after installation and never elsewhere unless a user faked that (in which case they could get around just about anything you try probably).
As an alternative (without knowing your application and what your setup file is doing) possibly you have a function that is normally imported anyway that isn't key to the running of the application but is good to have available (in numpy's case the functions are information about the installation like version(). Instead of keeping those functions where you put them now, you make them part of this file that is created. Then you are at least loading something that you would otherwise load anyway from somewhere else.
Using this method you are importing something no matter what, which has some overhead, or raising the error. I think as far as methods to raise an error because they aren't working out of the installed directory, it's a pretty clean and straightforward way to do it. No matter what method you use, you have some overhead of using that method so I would focus on keeping the overhead low, simple, and not going to cause errors.
I wouldn't do something that is complicated like parsing the setup file or modifying necessary files like __init__.py somewhere. I think you are right that those methods would be more error prone.
Checking if setup.py exists could work but I would consider it less clean than attempting to import which is already optimized as a standard Python function. They accomplish similar things but I think implemented the numpy style is going to be more straight forward.
New to Python, so excuse my lack of specific technical jargon. Pretty simple question really, but I can't seem to grasp or understand the concept.
It seems that a lot of modules require using pip or easy_install and running setup.py to "install" into your python installation or your virtualenv. What is the difference between installing a module and simply taking it and importing the into another script? It seems that you access the modules the same way.
Thanks!
It's like the difference between:
Uploading a photo to the internet
Linking the photo URL inside an HTML page
Installing puts the code somewhere python expects those kinds of things to be, and the import statement says "go look there for something named X now, and make the data available to me for use".
For a single module, it usually doesn't make any difference. For complicated webs of modules, though, an installation program may do many things that wouldn't be immediately obvious. For example, it may also copy data files into locations the new modules can find them, put executables (binary libraries, or DLLs on Windws, for example) where the new modules can find them, do different things depending on which version of Python you have, and so on.
If deploying a web of modules were always easy, nobody would have written setup programs to begin with ;-)
I made a Python module (https://github.com/Yannbane/Tick.py) and a Python program (https://github.com/Yannbane/Flatland.py). The program imports the module, and without it, it cannot work. I have intended for people to download both of these files before they can run the program, but, I am concerned about this a bit.
In the program, I've added these lines:
sys.path.append("/home/bane/Tick.py")
import tick
"/home/bane/Tick.py" is the path to my local repo of the module that needs to be included, but this will obviously be different to other people! How can I solve this situation better?
What suggested by #Lattyware is a viable option. However, its not uncommon to have core dependencies boundled with the main program (Django and PyDev do this for example). This works fine especially if the main code is tweaked against a specific version of the library.
In order to avoid the troubles mentioned by Lattyware when it comes to code maintenance, you should look into git submodules, which allow precisely this kind of layout, keeping code versioning sane.
From the structure of your directory it seems that both files live in the same directory. This might be the tell-tale than they might be two modules of a same package. In that case you should simply add an empty file called __init__.py to the directory, and then your import could work by:
import bane.tick
or
from bane import tick
Oh, and yes... you should use lower case for module names (it's worth to take a in-depth look at PEP8 if you are going to code in python! :)
HTH!
You might want to try submitting your module to the Python Package Index, that way people can easily install it (pip tick) into their path, and you can just import it without having to add it to the python path.
Otherwise, I would suggest simply telling people to download the module as well, and place it in a subdirectory of the program. If you really feel that is too much effort, you could place a copy of the module into the repository for the program (of course, that means ensuring you keep both versions up-to-date, which is a bit of a pain, although I imagine it may be possible just to use a symlink).
It's also worth noting your repo name is a bit misleading, capitalisation is often important, so you might want to call the repo tick.py to match the module, and python naming conventions.