Not able to use numpy inside a udf function - python

I am trying to run some code on a spark kubernetes cluster
"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"
The code I am trying to run is the following
def getMax(row, subtract):
'''
getMax takes two parameters -
row: array with parameters
subtract: normal value of the parameter
It outputs the value most distant from the normal
'''
try:
row = np.array(row)
out = row[np.argmax(row-subtract)]
except ValueError:
return None
return out.item()
from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())
The dataframe I am passing is as below
However I am getting the following error
ModuleNotFoundError: No module named 'numpy'
When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.
ImportError: No module named numpy on spark workers
And the crazy part is I am able to import numpy outside and
import numpy as np
command outside the function is not getting any errors.
Why is this happening? How to fix this or how to go forward. Any help is appreciated.
Thank you

Related

Trying to create a numpy array and failing

So I am trying to run a code that was regularly working, but now I get an error
The first lines of my code are:
import numpy as np
array = np.array([0,1,2])
Then, I get the following error:
'list' object is not callable
This code is working fine in Jupyter notebook. You might want to restart your session and try again. An other alternative would be to use a different name to assign the array object instead of 'array' itself.

Python 3 Attribute Error: Statsmodels has no attribute 'tools'

Im trying to use the following code (example):
import pandas as pd
(import statsmodels.api as sm) - Tried adding, no luck
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
x = pd.DataFrame(imports a vector)
plot_acf(x)
There is some code in between, but the problem arises when Python tries to plot the autocorrelation using statsmodels, and returns the following error:
File "/Users/user/anaconda/lib/python3.6/site-packages/statsmodels/iolib/foreign.py",
line 20, in <module>
import statsmodels.tools.data as data_util
AttributeError: module 'statsmodels' has no attribute 'tools'
I tried reinstalling multiple libraries, but nothing seems to get me past this error. Could this be a statsmodels-side bug?

module 'numpy' has no attribute 'testing'

I want to write a program which performs a periodogram on a series of measurement values listed in the file 'flux.txt' but I get the error:
module 'numpy' has no attribute 'testing'
The error also appears if I comment the whole code. I tried to update numpy but it's still updated. May someone help me please?
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
with open('flux.txt','r') as f:
item = f.readlines
print(item)
signal.periodogram(item)
plt.show()

Name error when calculating the average error between two data sets

I want to calculate the average error between two data sets and write the result to a file but my code gives this error:
NameError: name 'first1' is not defined
Could you please tell me how to fix this error?
The command I use to run the code is here:
python script.py input1.txt input2.txt > output.txt
My code:
import numpy
from numpy import *
import scipy
import scipy.stats
import matplotlib. pyplot as plt
import matplotlib.patches as patches
from pylab import *
import scipy.integrate
from operator import itemgetter, attrgetter
import sys
def main(argv):
t = open(sys.argv[1])
first1 = t.readline()
tt = open(argv[2])
second2 = tt.readline()
return [first1], [second2]
def analysis(first1, second2):
first = np.array(first1, dtype = np.float64)
second = np.array(second2, dtype = np.float64)
#Average error
avgerr = (first - second).mean()
return [avgerr]
analysis(first1, second2)
if __name__ == '__main__':
sys.exit(main(sys.argv))
input1.txt:
2.5
2.8
3.9
4.2
5.8
input2.txt:
0.8
2.5
3.2
5.8
6.3
Where are you stuck on this? The first active statement executed in your main program is
analysis(first1, second2)
Neither of those variables is defined anywhere in the main program.
That's why you get the error. The sequence you have is something like this:
import stuff
define (but don't execute) main function
define (but don't execute) analysis function
call analysis, giving it variables first1 & second2
Again, those variables are not defined yet.
Your line:
analysis(first1, second2)
Is causing the error. This is because you are calling the function without providing values.
The way you have structured your code is it's expected to be run via command line.
If you want to test your script without using command line, you could change your code line above to..
analysis('input1.txt', 'input2.txt')

can't define a udf inside pyspark project

I have a python project that uses pyspark and i am trying to define a udf function inside the spark project (not in my python project) specifically in spark\python\pyspark\ml\tuning.py but i get pickling problems. it can't load the udf.
The code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -x[1], returnType=FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()
when i try d.show() i am getting exception of unknown attribute test_udf
In my python project i defined many udf and it worked fine.
add the following to your code. It isn't recognizing the datatype.
from pyspark.sql.types import *
Let me know if this helps. Thanks.
Found it there was 2 problems
1) for some reason it didn't like the returnType=FloatType() i needed to convert it to just FloatType() though this was the signature
2) The data in column x was a vector and for some reason i had to cast it to float
The working code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -float(x[1]), FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()

Categories

Resources