submit job with pandas in a zip file

submit job with pandas in a zip file - python

I have two libraries: Pandas and utils (my library), and I want to import in my code. Since I was testing Pandas does not work as well.
Using boto3 and requests (without being preinstalled in the cluster) it works creating two zip files:
libs.zip: with boto3 and requests
dependencies.zip: utils
So, I import Pandas using a requirements file and creating a zip with all Pandas dependencies. I've tried importing the zip file within the code, like:
sc.addPyFile("libs.zip")
and the spark submit is like:
spark-submit --deploy-mode client --py-files s3://${BUCKET_NAME}/libs.zip s3://${BUCKET_NAME}/main.py
I tried a lot to submit a spark job in EMR cluster and I don't have any idea about this issue:
Traceback (most recent call last):
File "/mnt/tmp/spark-xxxx/main.py", line 20, in <module>
import pandas as pd
File "/mnt/tmp/spark-xxxx/userFiles-xxxx/libs.zip/pandas/__init__.py", line 17, in <module>
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.7 from "/usr/bin/python3"
* The NumPy version is: "1.19.4"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: No module named 'numpy.core._multiarray_umath'
How can I import Pandas and another library (created by me) in spark submit.

Related

CRONTAB on MAC --> ImportError: Unable to import required dependencies: numpy:

After years of reading upon questions and all of your help here on Stack I am now officially a member :)
I am building a script using pandas (for data handling) and want to export this data to excel using df.to_excel(writer) using pd.ExcelWriter.
Scheduling raw python jobs with CronTab is all going fine, tested using * * * * * so the output from the open('x.csv', 'w') as file.... all worked ok.
My Python version is 3.9.7 and the Numpy is 1.21.2
The problem arises when I use the CronTab scheduler on a python script using Pandas and the mail says:
*Traceback (most recent call last):
File "main.py", line 1, in
import pandas as pd
File "/Users/THIS IS ME/Library/Python/3.8/lib/python/site-packages/pandas/init.py", line 16, in
raise ImportError(
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
The Python version is: Python3.8 from "/Library/Developer/CommandLineTools/usr/bin/python3"
The NumPy version is: "1.21.2"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.***

How does one update Python through the terminal on WinSCP?

I am trying to run a script which involves numpy through the terminal on WinSCP, but whenever I do, I get the following error:
import gensim
File "/data/work/worker/gensim/init.py", line 5, in
from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils # noqa:F401
File "/data/work/worker/gensim/parsing/init.py", line 4, in
from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
File "/data/work/worker/gensim/parsing/preprocessing.py", line 42, in
from gensim import utils
File "/data/work/worker/gensim/utils.py", line 38, in
import numpy as np
File "/data/work/worker/numpy/init.py", line 142, in
from . import core
File "/data/work/worker/numpy/core/init.py", line 50, in
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
The Python version is: Python2.7 from "/usr/bin/python"
The NumPy version is: "1.18.5"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: No module named _multiarray_umath>
I suspect that the issue is that the version of Python being quoted here is outdated; however, after having done some searching, I cannot find any literature on how to update Python on WinSCP. I have Python 3.8 installed on my machine, and I have tried moving the installer and the .exe file into the WinSCP directory to no avail. Is there any way to update python directly in the terminal? Alternately, is this issue actually nothing to do with a stale version of Python at all?

How to import numpy through xlwings package? "ImportError: DLL load failed: The specified module could not be found."

I'm trying to use the 'Run Python' function of xlwings to run Python code through VBA.
I have been using Spyder to execute my code and it runs with no errors.
When trying to run this from VBA with the xlwings package I receive:
"ImportError: DLL load failed: The specified module could not be found."
and this error relates to the numpy package.
I tried uninstalling and reinstalling the anaconda package and using pip install numpy.
I checked that I have the most up to date version of xlwings 0.15.8.
I found this thread https://github.com/xlwings/xlwings/issues/954 stating this issue was fixed with version 0.15.7 of xlwings.
VBA code:
RunPython ("import Demand; Demand.calibrate_Demand()")
Spyder code:
import numpy as np
import xlwings as xw
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.arima_model import ARMAResults
from matplotlib import pyplot as plt
import datetime
def calibrate_dDemand():
My Python file is called Demand.py
When executing my VBA code I receive the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "e:\julia\calibration automation\Demand.py", line 17, in <module>
import numpy as np
File "C:\Users\julia\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\__init__.py", line 140, in <module>
from . import _distributor_init
File "C:\Users\julia\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\_distributor_init.py", line 34, in <module>
from . import _mklinit
ImportError: DLL load failed: The specified module could not be found.
If I place import pandas as pd first (before importing numpy) I receive this error
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "e:\julia\calibration automation\Demand.py", line 19, in <module>
import pandas as pd
File "C:\Users\julia\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\__init__.py", line 19, in <module>
"Missing required dependencies {0}".format(missing_dependencies))
ImportError: Missing required dependencies ['numpy']

I've been trying for a couple of weeks to hit back against this problem too.
I have a company-owned laptop which is severely restricted to what environment variables I have access too, and PATH is not one of them.
I have no trouble linking Excel with Python (Spyder / Anaconda3 installation), but as soon as add import numpy as np to my python code, I get the same DLL load failed error. (It didn't always do this though, it just started doing it overnight for no reason I can fathom, but I digress). The code itself works fine in Spyder though.
I did eventually find something of a workaround though, but it's a bit of a last resort, being more than a bit awkward and daft. Maybe useful for a developer to track down a more sensible solution though?
Anyway, I found that if I launched Excel from Spyder, the numpy module correctly imports, and remains stable. To do this, I just access the contents of one of the workbook's cells without having Excel already open. This launches Excel, opens the workbooks, and correctly imports numpy.
What do you think?

Python, Pandas, and Pico: Unable to import Pandas, but NumPy is no problem

I've managed to construct a simple app utilizing the Pico framework (https://github.com/fergalwalsh/pico). My frontend is connecting to my backend without any difficulties. Below is my Python file, which at the moment simply returns/renders a string, using a client-side input value, "name".
from __future__ import absolute_import
import sys
import pico
import numpy as np
# import sklearn
# import pandas as pd
from api2 import aloha
from pico import PicoApp
#pico.expose()
def hello(name):
a = np.arange(15).reshape(3, 5)
# a = np.arrange('data', 'field').reshape(3,5)
return "hello %s, %s" %(name, a)
app = PicoApp()
app.register_module(__name__)
(It also returns a NumPy array, simply because I'm testing what I can import into the file.)
All my packages are installed just fine, via Anaconda in /site-packages, which is in the python3.6 directory.
Oddly, the app runs fine; it can import NumPy. It breaks, however, when I try to import Pandas or SKLearn. I've tried manually copying and pasting NumPy into /Library/Python/2.7/site-packages, which actually breaks the app. But NumPy works in the app when it is only located in Anaconda's /site-packages.
I've tried altering app.register(__name__) to app.register('api'), which is the name of the Python file (api.py), based on another Question/Answer here. I've also tried reinstalling Pandas with sudo -H pip install pandas, but all the requirements are already satisfied.
This is the error that is thrown when I try to include Pandas in api.py:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Library/Python/2.7/site-packages/pico/server.py", line 31, in <module>
app = import_string(module_name)
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 443, in import_string
sys.exc_info()[2])
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 431, in import_string
module = import_string(module_name)
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 443, in import_string
sys.exc_info()[2])
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 418, in import_string
__import__(import_name)
File "./api.py", line 6, in <module>
import pandas as pd
File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 23, in <module>
from pandas.compat.numpy import *
File "/Library/Python/2.7/site-packages/pandas/compat/numpy/__init__.py", line 24, in <module>
'this pandas version'.format(_np_version))
werkzeug.utils.ImportStringError: import_string() failed for 'api.app'. Possible reasons are:
- missing __init__.py in a package;
- package or module path not included in sys.path;
- duplicated package or module name taking precedence in sys.path;
- missing module, class, function or variable;
Debugged import:
- 'api' not found.
Original exception:
ImportStringError: import_string() failed for 'api'. Possible reasons are:
- missing __init__.py in a package;
- package or module path not included in sys.path;
- duplicated package or module name taking precedence in sys.path;
- missing module, class, function or variable;
Debugged import:
- 'api' not found.
Original exception:
ImportError: this version of pandas is incompatible with numpy < 1.9.0
your numpy version is 1.8.0rc1.
Please upgrade numpy to >= 1.9.0 to use this pandas version
When I run which python, it points to /Users/richardscheiwe/anaconda3/bin/python. Also, I have NumPy v.1.15 installed, and I can't find any other NumPy folder(s). When I try moving a version of NumPy to Library/Python/2.7/site-packages, I get this error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
I guess I need to somehow point the app's Python to Anaconda's Python 3.6 version, but I don't know how to do that. Pico is also available in Anaconda's /site-packages directory, but it isn't pointing there.
Any help is greatly appreciated. I've scoured StackOverflow and GitHub.

You don't mention how you are starting the pico app but I assume you are doing like this:
python -m pico.server api
In this case it will simply use whatever python is in your path. If it is python3 in /Users/richardscheiwe/anaconda3/bin/python but you are getting errors referring to /Library/Python/2.7/ then there is some problem with your anaconda installation/paths in your environment.
There is nothing different with pico to running a plain python script but I suggest you create a simplified script without pico (literally just import pandas) to work out your environment issues with simpler error messages.

If I'm reading this correctly the error seems to come from trying to use a version of NumPy built for running on python 2.6 while your app is running using Python3.
Try removing NumPy using; "sudo pip uninstall numpy" and then use "pip -H install Numpy" to try reinstalling it and seeing if it correctly finds the Python3 version of Numpy

Numpy and static linking

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:
Traceback (most recent call last):
File "/home/user/spark-script.py", line 12, in <module>
import numpy
File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray
The script is actually quite simple:
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc.addPyFile('numpy.zip')
import numpy
a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()
I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpy module with static linking?
Thanks.

There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.
First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.
NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.
Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.
If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:
try:
import numpy as np
expect ImportError:
import pip
pip.main(["install", "--user", "numpy"])
import numpy as np
You'll find other variants of this method in How to install and import Python modules at runtime?
Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).
See also: shipping python modules in pyspark to other nodes?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

submit job with pandas in a zip file - python

Related

CRONTAB on MAC --> ImportError: Unable to import required dependencies: numpy:

How does one update Python through the terminal on WinSCP?

How to import numpy through xlwings package? "ImportError: DLL load failed: The specified module could not be found."

Python, Pandas, and Pico: Unable to import Pandas, but NumPy is no problem

Numpy and static linking

Categories

Resources