How to import matplotlib python library in pyspark using sc.addPyFile()?

How to import matplotlib python library in pyspark using sc.addPyFile()? - python

I am using spark on python both iteratively launching the command pyspark from Terminal and also launching an entire script with the command spark-submit pythonFile.py
I am using to analyze a local csv file, so no distributed computation is performed.
I would like to use the library matplotlib to plot columns of a dataframe. When importing matplotlib I get the error ImportError: No module named matplotlib. Then I came across this question and tried the command sc.addPyFile() but you could not find any file relating to matplotlib that I can pass to it on my OS (OSX).
For this reason I created a virtual environment and installed matplotlib with it. Navigating through the virtual environment I saw there was no file such as marplotlib.py so I tried to pass it the entire folder sc.addPyFile("venv//lib/python3.7/site-packages/matplotlib") but again no success.
I do not know which file I should include or how at this point and I ran out of ideas.
Is there a simple way to import matplotlib library inside spark (installing with virtualenv or referencing the OS installation)? And if so, which *.py files I should pass the command sc.addPyFile()
Again I am not interested in distributed computation: the python code will run only locally on my machine.

I will post what I have done. First of all I am working with virtualenv. So I created a new one with virtualenv path.
Then I activated it with source path/bin/activate.
I installed the packages I needed with pip3 install packageName.
After that I created a script in python that creates a zip archive of the libraries installed with virtualenv in the path ./path/lib/python3.7/site-packages/.
The code of this script is the following (it is zipping only numpy):
import zipfile
import os
#function to archive a single package
def ziplib(general_path, libName):
libpath = os.path.dirname(general_path + libName) # this should point to your packages directory
zippath = libName + '.zip' # some random filename in writable directory
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
general_path = './path//lib/python3.7/site-packages/'
matplotlib_name = 'matplotlib'
seaborn_name = 'seaborn'
numpy_name = 'numpy'
zip_path = ziplib(general_path, numpy_name) # generate zip archive containing your lib
print(zip_path)
After that the archives must be referenced in the pyspark file myPyspark.py. You do this by calling the method addPyFile() of the sparkContext class. After that you just can import in your code as always. In my case I did the following:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sc.addPyFile("matplot.zip") #generate with testZip.py
sc.addPyFile("numpy.zip") #generate with testZip.py
import matplotlib
import numpy
When you are launching the script you have to reference the zip archives in the command with --py-files. For example:
sudo spark-submit --py-files matplot.zip --py-files numpy.zip myPyspark.py
I considered two archives because to me it was clear how to import one but not two of them.

You can zip the matplotlib directory and pass it to addPyFile(). Or alternatively you can define a environment variable which includes user packages: export PYTHONPATH="venv//lib/python3.7/site-packages/:$PYTHONPATH"

Create a py file with your code. Add pyfile to spark context.
import matplotlib.pyplot as plt
plt.<your operations>
save the file as file.py. Add this to sparkcontext
spark.sparkContext.addPyFile("file.py")

Related

Why does Colab not find the parent directory for relative imports?

I have a python file that I tested locally using VS-Code and the shell. The file contains relative imports and worked as intended on my local machine.
After uploading the associated files to Colab I did the following:
py_file_location = '/content/gdrive/content/etc'
os.chdir(py_file_location)
# to verify whether I got the correct path
!python3
>>> import os
>>> os.getcwd()
output: '/content/gdrive/MyDrive/content/etc'
However, when I run the file I get the following error:
ImportError: attempted relative import with no known parent package
Why is that? Using the same file system and similar shell commands the file worked locally.

A solution that worked for me was to turn my relative paths into static ones, so instead of using:
directory = pathlib.Path(__file__).parent
sys.path.append(str(directory.parent))
sys.path.append(str(directory.parent.parent.parent))
__package__ = directory.name
I needed to make the path static by using resolve():
directory = pathlib.Path(__file__).resolve().parent
sys.path.append(str(directory.parent))
sys.path.append(str(directory.parent.parent.parent))
__package__ = directory.name
It is however still not fully clear to me why this is required when running on Colab but not when running locally.

Correct way to link to a python file

I am writing a code organized in some (several) files. For the sake of organization of folders and the CMakeLists.txt, a pythonlibs folder is created during the build process, and some links to python files are created to libraries in the /build/src/XXXX/ folder.
In the python file, I add to the python path:
sys.path.insert(1,'/opt/hpc/softwares/erfe/erfe/build/pythonlibs')
import libmsym as msym
When I run the main python file, there is this one library lybmsym that fails with:
import libmsym as msym
File "/opt/hpc/softwares/erfe/erfe/build/pythonlibs/libmsym.py", line 15, in <module>
from . import _libmsym_install_location, export
ImportError: attempted relative import with no known parent package
I created a link using cmake, but I believe it does use the ln command (tried both hard and symbolic). Is there a way to prevent this behavior without changing the library itself, just another way to create this link?
Thanks.

Cannot set pythonpath environment variable from python

I'm trying to debug a project that has a lot of additional libraries added to PYTHONPATH at runtime before launching the python file.
I was not able to add those commands with tasks.json file prior to debugging python file in Visual Studio code (see post Visual Studio Code unable to set env variable paths prior to debugging python file), so I'm just adding them via an os.system("..") command
I'm only showing 1 of the libraries added below:
# Standard library imports
import os
import sys
os.system("SET PYTHONPATH=D:\\project\\calibration\\pylibrary\\camera")
# Pylibrary imports
from camera import capture
When I debug, it fails on line from camera import capture with:
Exception has occurred: ModuleNotFoundError
No module named 'camera'
File "D:\project\main.py", line 12, in <module>
from camera.capture import capture
I also tried
os.environ['PYTHONPATH']="D:\\project\\pylibrary\\camera" and I still get the same error
Why is it not remembering the pythonpath while running the script?
How else can I define the pythonpath while running Visual Studio Code and debugging the project file?
I know I can add the pythonpath to env variables in windows, but it loads too many libraries and I want it to only remember the path while the python script is executed.
Thanks

Using os.system() won't work because it starts a new cmd.exe shell and sets the env var in that shell. That won't affect the env vars of the python process. Assigning to os.environ['PYTHONPATH'] won't work because at that point your python process has already cached the value, if any, of that env var in the sys.path variable. The solution is to
import sys
sys.path.append(r"D:\project\calibration\pylibrary\camera")

Run all Python scripts in a folder - from Python

How can I write a Python program that runs all Python scripts in the current folder? The program should run in Linux, Windows and any other OS in which python is installed.
Here is what I tried:
import glob, importlib
for file in glob.iglob("*.py"):
importlib.import_module(file)
This returns an error: ModuleNotFoundError: No module named 'agents.py'; 'agents' is not a package
(here agents.py is one of the files in the folder; it is indeed not a package and not intended to be a package - it is just a script).
If I change the last line to:
importlib.import_module(file.replace(".py",""))
then I get no error, but also the scripts do not run.
Another attempt:
import glob, os
for file in glob.iglob("*.py"):
os.system(file)
This does not work on Windows - it tries to open each file in Notepad.

You need to specify that you are running the script through the command line. To do this you need to add python3 plus the name of the file that you are running. The following code should work
import os
import glob
for file in glob.iglob("*.py"):
os.system("python3 " + file)
If you are using a version other than python3, just change the argument from python3 to python

Maybe you can make use of the subprocess module; this question shows a few options.
Your code could look like this:
import os
import subprocess
base_path = os.getcwd()
print('base_path', base_path)
# TODO: this might need to be 'python3' in some cases
python_executable = 'python'
print('python_executable', python_executable)
py_file_list = []
for dir_path, _, file_name_list in os.walk(base_path):
for file_name in file_name_list:
if file_name.endswith('.csv'):
# add full path, not just file_name
py_file_list.append(
os.path.join(dir_path, file_name))
print('PY files that were found:')
for i, file_path in enumerate(py_file_list):
print(' {:3d} {}'.format(i, file_path))
# call script
subprocess.run([python_executable, file_path])
Does that work for you?
Note that the docs for os.system() even suggest using subprocess instead:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

If you have control over the content of the scripts, perhaps you might consider using a plugin technique, this would bring the problem more into the Python domain and thus makes it less platform dependent. Take a look at pyPlugin as an example.
This way you could run each "plugin" from within the original process, or using the Python::multiprocessing library you could still seamlessly use sub-processes.

import local python module in HTCondor

This concerns the importing of my own python modules in a HTCondor job.
Suppose 'mymodule.py' is the module I want to import, and is saved in directory called a XDIR.
In another directory called YDIR, I have written a file called xImport.py:
#!/usr/bin/env python
import os
import sys
print sys.path
import numpy
import mymodule
and a condor submit file:
executable = xImport.py
getenv = True
universe = Vanilla
output = xImport.out
error = xImport.error
log = xImport.log
queue 1
The result of submitting this is that, in xImport.out, the sys.path is printed out, showing XDIR. But in xImport.error, there is an ImporError saying 'No module named mymodule'. So it seems that the path to mymodule is in sys.path, but python does not find it. I'd also like to mention that error message says that the ImportError originates from the file
/mnt/novowhatsit/YDIR/xImport.py
and not YDIR/xImport.py.
How can I edit the above files to import mymodule.py?

When condor runs your process, it creates a directory on that machine (usually on a local hard drive). It sets that as the working directory. That's probably the issue you are seeing. If XDIR is local to the machine where you are running condor_submit, then it's contents don't exist on the remote machine where the xImport.py is running.
Try using the .submit feature transfer_input_files mechanism (see http://research.cs.wisc.edu/htcondor/manual/v7.6/2_5Submitting_Job.html) to copy the mymodule.py to the remote machines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to import matplotlib python library in pyspark using sc.addPyFile()? - python

You can zip the matplotlib directory and pass it to addPyFile(). Or alternatively you can define a environment variable which includes user packages: export PYTHONPATH="venv//lib/python3.7/site-packages/:$PYTHONPATH"

Create a py file with your code. Add pyfile to spark context. import matplotlib.pyplot as plt plt.<your operations> save the file as file.py. Add this to sparkcontext spark.sparkContext.addPyFile("file.py")

Related

Why does Colab not find the parent directory for relative imports?

Correct way to link to a python file

Cannot set pythonpath environment variable from python

Run all Python scripts in a folder - from Python

import local python module in HTCondor

Categories

Resources