Can I zip PySpark dependencies containing some setuptools.Extension?

Can I zip PySpark dependencies containing some setuptools.Extension? - python

I am attempting to include the dateparser package for a PySpark (v2.4.3) shell session by a short little zip build process pip install -r requirements.txt -t some_target && cd some_target && zip -r ../deps.zip . && cd .., after which I would, for example, pyspark --py-files deps.zip. When importing dateparser, though, I get an indirect ModuleNotFoundError from the regex library, whining that "No module named 'regex._regex'" (stack trace says this is referenced in /mnt/tmp/spark-some/long/path/deps.zip/regex/_regex_core.py line 21, which is of course referenced much farther up the stack by dateparser).
I attempted adding a flag to the dateparser line in requirements.txt like dateparser --no-binary=regex, but the error persisted. A normal python shell is able to import without issue, and other packages in this zip seem to be importable in PySpark shell without issue. This has led me down a number of rabbit holes, but I think/hope I have finally found the culprit: namely, that regex._regex is not a normal .py file, but rather a .so. My knowledge of python build process is limited, but it seems that regex library's setup.py uses the setuptools.Extension class to compile some C files into this shared object. I have seen suggestions to modify LD_LIBRARY_PATH environment variable in order to make those shared objects discoverable to python, but a number of comments also suggested this was dangerous and not a viable long-term solution. The fact that a normal python interactive session has no issue with the import also has me skeptical, since the LD_LIBRARY_PATH variable doesn't even exist in os.environ within that interactive shell. I'm thence left wondering if --py-files is insufficient for including packages that compile these Extension objects (seems unlikely, since there are a lot of people doing crazier things than my simple use case), or if this actually stems from neglecting some other setting.
Merci mille fois for any and all help :)

The error appears to stem from the import statements not being able to recognize binary (.so) files within a zip archive, i.e., the dependencies.zip that I pass with the --py-files parameter. I first tried pulling out regex dependency and building a .whl to include in --py-files, to discover that my version of PySpark (v2.4.3) predates wheel support. I was, however, able to build an .egg based on the source code, then set PYTHON_EGG_CACHE and PYTHON_EGG_DIR env variables for spark.executorEnv and spark.driverEnv... Not sure if the last step would be necessary for others; it seems to have stemmed from weird permissions issues that may just apply to my user/group/use case.

Related

How to update source files for pytest?

pytest appears to be using old source code and failing tests because of it. I'm not sure how to update it.
Test code:
from nba_stats import league
class TestLeaders():
def test_default():
leaders = league.Leaders()
print(leaders)
Source code (league.py):
from nba_stats.nba_api import NbaAPI
from nba_stats import constants
class Leaders:
...
When I run pytest on my parent directory, I get an error that refers to an old import statement.
_____________________________ ERROR collecting test/test_league.py ______________________________
ImportError while importing test module '/home/mfb/src/nba_stats/test/test_league.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
test_league.py:1: in <module>
from nba_stats import league
../../../.virtualenvs/nba_stats_dev/lib/python3.6/site-packages/nba_stats/league.py:1: in <module>
from nba_stats import _api_scrape, _get_json
E ImportError: cannot import name '_api_scrape'
I tried resetting my virtualenvironment and also reinstalling my package via pip. What do I need to do to tell it to see the new import statement and why is this happening?
Edit: Deleting my virtual environment completely and then creating a new one seemed to fix it, but it seems to be a recurring issue with any further source code changes. Surely there must be a way to not have to reset my virtualenvironment each time?

Looks like you installed that package (possibly as a dependency through something else if not directly) and also have it cloned locally for development. You can look into local editable installs (https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs), but personally, I prefer to make the test refer directly to the package under which it is being run, since then it can be used "as-is" after cloning it. Do this by modifying sys.path in your test_league.py. Ie., assuming it has a structure with the python code under python/nba_stats, in the parent directory of `test
sys.path = [os.path.join(os.pardir, 'python')] + sys.path
at the top of test_league.py. This puts your local package up front and import will consider it first.
EDIT:
Since you tried and it still did not work (please do make sure that the snippet above does point to the local python package as in the actual structure; the above is just a common one but you may have a different structure), here is how you can see which directories are considered in order, and which are eventually selected:
python -vv -m pytest -svx
You will be able to see if there are spurious directories in sys.path, whether the one tried (as in the snippet above) matches as expected or not, any left-over .pyc files that get picked up, etc.
EDITv2: Since you stated that python -m pytest works, but pytest not, have a look where that pytest executable is coming from with which pytest. Likely it's a system one that refers to a different python then the one in your virtualenv. To see which python it picks up, do:
cat `which python`
and look at the top line.
If that is not the same as what which python gives you (with your desired virtualenv activated), you may have to install pytest for that current virtualenv (python -m pip install pytest).

Why doesn't ./configure work in python setup.py?

When using subprocess.call(["./configure"]) and then subprocess.call(["make"]) in a python setup.py file, why might autotools look for the wrong version of automake? We are calling:
$ python setup.py install
....
WARNING: 'automake-1.13' is missing on your system.
You should only need it if you modified 'Makefile.am' or
'configure.ac' or m4 files included by 'configure.ac'.
The 'automake' program is part of the GNU Automake package:
<http://www.gnu.org/software/automake>
It also requires GNU Autoconf, GNU m4 and Perl in order to run:
<http://www.gnu.org/software/autoconf>
<http://www.gnu.org/software/m4/>
<http://www.perl.org/>

Short answer: turn AM_MAINTAINER_MODE off with --disable-maintainer-mode.
Long answer: Despite the version difference, it should not error out since it works fine on the command line. Something with the Python packaging process is interfering.
When you do
$ python setup.py sdist
the setuptools module creates hard links, makes a tar archive from that, then deletes the hard links. During this linking process, the timestamps on the files have been modified and don't match the original modification times, creating the illusion that some of the source files have been modified.
When the Makefile is run, it notices the timestamp difference. If AM_MAINTAINER_MODE is enabled, it runs the missing script. This script then detects the difference in versions of aclocal, causing make to error out.
Passing the --disable-maintainer-mode option to the configure script should suppress the invocation of the missing script and allow the build to succeed:
subprocess.call(["./configure", "--disable-maintainer-mode"])
subprocess.call(["make"])
(See here for more information about automake's maintainer mode. Apparently the timestamp business is also a problem with users of CVS.)

Python Pylearn2 package "ImportError: No module named pylearn2.utils"

I have recently tried to use pylearn2, a deep machin learning package for Python developed at University of Montreal.
I've just installed it and tried to run a simple example, but it did not work.
I have been using a pc with an Ubuntu 13.10 system, on which I found ipython installed.
I have installed Theano and later pylearn2, by following this webpage instructions:
http://deeplearning.net/software/pylearn2/
I have also modified the .bashrc file, as suggested
I thought that everything went well, and then I tried this Quick start example:
http://deeplearning.net/software/pylearn2/tutorial/index.html
I stopped at the first command:
python make_dataset.py
My terminal states:
Traceback (most recent call last): File "make_dataset.py", line 14,
in
Do you have any ideas on why it is not working?
Do you why these errors occur?
Thanks a lot
EDIT: the 14 line is the first non-commented line of the file. It states
from pylearn2.utils import serial

Without more information, I can only guess, but my first guess is…
You haven't actually installed pylearn2, because if you follow the linked docs to grab the git repo and add a PYLEARN2_DATA_PATH variable, nothing gets installed into site-packages (or dist-packages or anywhere else on sys.path).
This means that pylearn2 will only work when you start Python from within the top-level directory of the pylearn2 repo.
So, if you run a script like this:
$ cd /path/to/pylearn2
$ cd scripts/tutorials/grbm_smd/
$ python make_dataset.py
… it won't actually work.
It looks like there is a setup.py file in the repository. Does it work? I have no idea. Even though the docs don't mention using it, you might want to try. Either this:
$ pip install .
… or, if you don't have pip or it doesn't work on this package:
$ python setup.py install
Either way, of course, you may need sudo or a flag to install to your user site-packages instead of system, etc., as with any other Python package.
If that doesn't work, you might be able to just add /path/to/pylearn2 to your sys.path in some way. The most obvious way is by doing an export PYTHONPATH=/path/to/pylearn2:$PYTHONPATH in your ~/.bashrc.
Also, you will need to either source ~/.bashrc or create a new shell to get any effects of modifying the file.
If you're wondering why the instructions and the tutorial together don't give you enough information to make this work without a lot of hassle, I think that's covered in the very top of the documentation:
Pylearn2 is still undergoing rapid development. Don’t expect a clean road without bumps!
And the very fact that there is no PyPI download yet implies that this really is not ready for novices to use. If you don't know enough about using Python packages (and bash basics) to muddle through on your own, there's a good chance you won't be able to use this package.

Python installation, failing to find bz2 module

OK, I have an old Debian VM. Package managers are useless. No, I'm not going to update the OS.
I have the bzip2 lib and development headers installed correctly on my system (those actually came from a package).
I start with absolutely NO Python on the system. I removed everything manually. I downloaded the Python 2.7.5 source, and configured with ./configure --prefix=/usr. It configures fine. I run make, and it compiles fine. I try ./python -c "import bz2; print bz2.__doc__", it works, and says:
The python bz2 module provides a comprehensive interface for
the bz2 compression library. It implements a complete file
interface, one shot (de)compression functions, and types for
sequential (de)compression.
I then run make test and the whole test suite progresses fine, and notably the "test_bz2" test passes.
I then run make install, which installs my new Python binary into /usr/bin/ like I wanted.
I try /usr/bin/python -c "import bz2; print bz2.__doc__", and it fails with:
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named bz2
I've tried a bunch of different things, including building Python as --enable-shared and not, no luck. I've tried at least 10 times (each time totally cleaning out everything, running make distclean, etc.). No luck.
I tried: PYTHONPATH="/usr/lib/python2.7"; export PYTHONPATH. Still no luck.
HOWEVER, if I delete the symlink that make install creates for /usr/bin/python, and instead do: ln -s /path/to/my/python/compile/python python, NOW it magically works.
So, what the heck? Why is this Python binary I'm getting created only able to find stuff when the binary exists in the compile directory, and not when it's put into normal production install location? What am I missing?
I am root during the entire process, from configure to make to make install to trying to test the Python import call.
I have started from scratch again (this time compiling with --enable-shared btw), and verified that not only in the compile directory is there build/lib.linux-x86_64-2.7/bz2.so, but once I run make install, that file is put into /usr/lib/python2.7/lib-dynload/bz2.so.
I've tried to do some reading on lib-dynload, but haven't been able to determine if there's something else a Python program (like default configuration for the CLI or whatever) would need to be able to tell it to pull module imports from lib-dynload, or if there's some other place or option to tell the make install where it should be putting it instead of dynload.
Still I have no explanation why the /path/to/compilation/python binary can find and load bz2.so fine, but the /usr/bin/python binary can't find (or load) /usr/lib/python2.7/lib-dynload/bz2.so.
I thought maybe it was something to do with the fact that the installation doesn't create like a /usr/lib/python symlink to point at /usr/lib/python2.7 directory. But I created the symlink and still no go.
I am still lost here.

It would appear that a sort of non-answer answer was arrived at accidentally via a long string of Twitter conversation(s).
I've filed another Stack Overflow question here to ask WHY what we found was the solution to this problem: https://stackoverflow.com/questions/17662091/python-installation-prefix-not-being-persisted-in-config
For posterity sake, right now the solution is that I have to set the PYTHONHOME environment variable to /usr, and everything starts working. The puzzling part is that the documentation says PYTHONHOME should default to {prefix}, which I was clearly setting as default during configure to /usr. So why should I have to manually set it?
Running python-config --prefix reveals that the {prefix} default is in fact /usr/bin, NOT /usr like I specified, which leads to me needing to override the default back to the default, bizarrely.

Why can't Python find shared objects that are in directories in sys.path?

I'm trying to import pycurl:
$ python -c "import pycurl"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: libcurl.so.4: cannot open shared object file: No such file or directory
Now, libcurl.so.4 is in /usr/local/lib. As you can see, this is in sys.path:
$ python -c "import sys; print(sys.path)"
['', '/usr/local/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg',
'/usr/local/lib/python25.zip', '/usr/local/lib/python2.5',
'/usr/local/lib/python2.5/plat-linux2', '/usr/local/lib/python2.5/lib-tk',
'/usr/local/lib/python2.5/lib-dynload',
'/usr/local/lib/python2.5/sitepackages', '/usr/local/lib',
'/usr/local/lib/python2.5/site-packages']
Any help will be greatly appreciated.

sys.path is only searched for Python modules. For dynamic linked libraries, the paths searched must be in LD_LIBRARY_PATH. Check if your LD_LIBRARY_PATH includes /usr/local/lib, and if it doesn't, add it and try again.
Some more information (source):
In Linux, the environment variable
LD_LIBRARY_PATH is a colon-separated
set of directories where libraries
should be searched for first, before
the standard set of directories; this
is useful when debugging a new library
or using a nonstandard library for
special purposes. The environment
variable LD_PRELOAD lists shared
libraries with functions that override
the standard set, just as
/etc/ld.so.preload does. These are
implemented by the loader
/lib/ld-linux.so. I should note that,
while LD_LIBRARY_PATH works on many
Unix-like systems, it doesn't work on
all; for example, this functionality
is available on HP-UX but as the
environment variable SHLIB_PATH, and
on AIX this functionality is through
the variable LIBPATH (with the same
syntax, a colon-separated list).
Update: to set LD_LIBRARY_PATH, use one of the following, ideally in your ~/.bashrc
or equivalent file:
export LD_LIBRARY_PATH=/usr/local/lib
or
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
Use the first form if it's empty (equivalent to the empty string, or not present at all), and the second form if it isn't. Note the use of export.

Ensure your libcurl.so module is in the system library path, which is distinct and separate from the python library path.
A "quick fix" is to add this path to a LD_LIBRARY_PATH variable. However, setting that system wide (or even account wide) is a BAD IDEA, as it is possible to set it in such a way that some programs will find a library it shouldn't, or even worse, open up security holes.
If your "locally installed libraries" are installed in, for example, /usr/local/lib, add this directory to /etc/ld.so.conf (it's a text file) and run ldconfig
The command will run a caching utility, but will also create all the necessary "symbolic links" required for the loader system to function. It is surprising that the make install for libcurl did not do this already, but it's possible it could not if /usr/local/lib is not in /etc/ld.so.conf already.
PS: it's possible that your /etc/ld.so.conf contains nothing but include ld.so.conf.d/*.conf. You can still add a directory path after it, or just create a new file inside the directory it's being included from. Dont forget to run ldconfig after it.
Be careful. Getting this wrong can screw up your system.
Additionally: make sure your python module is compiled against THAT version of libcurl. If you just copied some files over from another system, this wont always work. If in doubt, compile your modules on the system you intend to run them on.

You can also set LD_RUN_PATH to /usr/local/lib in your user environment when you compile pycurl in the first place. This will embed /usr/local/lib in the RPATH attribute of the C extension module .so so that it automatically knows where to find the library at run time without having to have LD_LIBRARY_PATH set at run time.

Had the exact same issue. I installed curl 7.19 to /opt/curl/ to make sure that I would not affect current curl on our production servers.
Once I linked libcurl.so.4 to /usr/lib:
sudo ln -s /opt/curl/lib/libcurl.so /usr/lib/libcurl.so.4
I still got the same error! Durf.
But running ldconfig make the linkage for me and that worked. No need to set the LD_RUN_PATH or LD_LIBRARY_PATH at all. Just needed to run ldconfig.

As a supplement to above answers - I'm just bumping into a similar problem, and working completely of the default installed python.
When I call the example of the shared object library I'm looking for with LD_LIBRARY_PATH, I get something like this:
$ LD_LIBRARY_PATH=/path/to/mysodir:$LD_LIBRARY_PATH python example-so-user.py
python: can't open file 'example-so-user.py': [Errno 2] No such file or directory
Notably, it doesn't even complain about the import - it complains about the source file!
But if I force loading of the object using LD_PRELOAD:
$ LD_PRELOAD=/path/to/mysodir/mypyobj.so python example-so-user.py
python: error while loading shared libraries: libtiff.so.5: cannot open shared object file: No such file or directory
... I immediately get a more meaningful error message - about a missing dependency!
Just thought I'd jot this down here - cheers!

I use python setup.py build_ext -R/usr/local/lib -I/usr/local/include/libcalg-1.0 and the compiled .so file is under the build folder.
you can type python setup.py --help build_ext to see the explanations of -R and -I

For me what works here is to using a version manager such as pyenv, which I strongly recommend to get your project environments and package versions well managed and separate from that of the operative system.
I had this same error after an OS update, but was easily fixed with pyenv install 3.7-dev (the version I use).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.