ImportError when running NuPIC model on PySpark - python

I am trying to run NuPIC on PySpark but I am getting an ImportError. Does anyone have any ideas for how I can fix it?
The code runs fine when I don't use PySpark, but I am trying to run it from a Spark Dataset now.
I am trying to run it using the source code I have in my directory, since running it by installing the Nupic package causes some other errors.
Thank you for your help!!
I am trying to run this function
input_data.rdd.foreach(lambda row: iterateRDD(row, model))
def iterateRDD(record, model):
modelInput = record.asDict(False)
modelInput["value"] = float(modelInput["value"])
modelInput["timestamp"] = datetime.datetime.strptime(modelInput["timestamp"], "%Y-%m-%d %H:%M:%S")
print"modelInput", modelInput
result = model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print "Anomaly score is", anomalyScore
However, I get this error and don't understand it.
File
"C:/Users/rakshit.trn/Documents/Nupic/nupic-master/examples/anomaly.py",
line 100, in runAnomaly
input_data.rdd.foreach(lambda row: iterateRDD(row, model)) File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 789,
in foreach self.mapPartitions(processPartition).count() # Force
evaluation File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py",
line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _
in i)]).sum() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 1046, in
sum return self.mapPartitions(lambda x: [sum(x)]).fold(0,
operator.add) File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 917, in
fold vals = self.mapPartitions(func).collect() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 816, in
collect sock_info =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File
"C:\Python\Python27\lib\site-packages\py4j\java_gateway.py", line
1257, in call answer, self.gateway_client, self.target_id,
self.name) File
"C:\Python\Python27\lib\site-packages\pyspark\sql\utils.py", line 63,
in deco return f(*a, **kw) File
"C:\Python\Python27\lib\site-packages\py4j\protocol.py", line 328, in
get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 364, in main File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 69, in read_command File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 172, in _read_with_length return self.loads(obj) File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 583, in loads return pickle.loads(obj) ImportError: No module
named frameworks.opf.htm_prediction_model
I guess that NuPIC is not able to get access to the frameworks/opf/htm_prediction_model.py file

You might be running an old version of NuPIC. See https://discourse.numenta.org/t/warning-0-7-0-breaking-changes/2200 and check what version you are using (https://discourse.numenta.org/t/how-to-check-what-version-of-nupic-is-installed/1045)

Related

getting mssparkutils.fs.unmount(f'{AMLLOGMount}')/dcid error in synapse spark after model training completes and during returning results back to AML

We run dataprep in AML workspace, for training we use synapse spark - when train runs for 3 hours for all our input data - it writes the output csv's to xyz folder in compute. At the end of training completion - it is failing with the error as shown below:
We don't explicitly handle mount/unmounting - the below MS SDK function take care of it:
training_output = OutputFileDatasetConfig(name=training_dataset_name).register_on_complete(name=training_dataset_name)
Error:
File "synapse_control_script.py", line 81, in <module>
main ()
File "synapse_control_script.py", line 78, in main
sys exit(1)
File
"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1665707557685_0001/container_1665707557685_0001_01_000001/./azureml-setup/synapse_context_manager.py", line 225, in __exit__
mssparkutils.fs.unmount(f'{_AML_LOG_MOUNT_ROOT}/dcid.{self._run_id}')
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/notebookutils/mssparkutils/fs.py", line 42, in unmount
return fs.unmount(mountPoint, extraOptions)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/notebookutils/mssparkutils/handlers/fsHandler.py", line 125, in unmount
return self.fsutils.unmount(mountPoint, extraOptions)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:mssparkutils.fs.unmount.

why not able to write sparkdataframe into parquet file?

I'm trying to write spark dataframe into parquet file but not able to write dataframe into parquet even i tried with csv
df is my dataframe
CUST_ID
---------------
00000082MM778Q49X
00000372QM8890MX7
00000424M09X729MQ
0000062Q028M05MX
my dataframe looks as above
df_parquet = (tempDir+"/"+"df.parquet") #filepath
customerQuery = f"SELECT DISTINCT(m.customer_ID) FROM ada_customer m INNER JOIN customer_nol mr ON m.customer_ID = mr.customer_ID \
WHERE mr.MODEL <> 'X' and m.STATUS = 'Process' AND m.YEAR = {year} AND mr.YEAR = {year}"
customer_df = sqlContext.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/dbkl",
driver="com.mysql.jdbc.Driver",
query=customerQuery, user="root", password="root").load()
# above lines are working only writing into file not working
customer_df.write.mode("overwrite").parquet(df_parquet)
i'm getting this error don't know exactly what's wrong. can some one help with this
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 135, in <module>
customer_xdf.write.mode("overwrite").parquet(customer_parquet)
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 1372, in csv
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 111, in deco
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
... 33 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 148, in <module>
logger.error(e)
File "F:\SparkBook\lib\logger.py", line 16, in error
self.logger.error(message)
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1296, in __call__
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in _build_args
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in <listcomp>
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 298, in get_command_part
AttributeError: 'Py4JJavaError' object has no attribute '_get_object_id'
Process finished with exit code 1

PermissionError while using pandas .to_csv function

I have this code to read .txt file from s3 and convert this file to .csv using pandas:
file = pd.read_csv(f's3://{bucket_name}/{bucket_key}', sep=':', error_bad_lines=False)
file.to_csv(f's3://{bucket_name}/file_name.csv')
I have provided read write permission to IAM role but still this errors comes for the .to_csv function:
Anonymous access is forbidden for this operation: PermissionError
update: full error in ec2 logs is:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 446, in _mkdir
await self.s3.create_bucket(**params)
File "/usr/local/lib/python3.6/dist-packages/aiobotocore/client.py", line 134, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous access is forbidden for this operation
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "convert_file_instance.py", line 92, in <module>
main()
File "convert_file_instance.py", line 36, in main
raise e
File "convert_file_instance.py", line 30, in main
file.to_csv(f's3://{bucket_name}/file_name.csv')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3165, in to_csv
decimal=decimal,
File "/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py", line 67, in __init__
path_or_buf, encoding=encoding, compression=compression, mode=mode
File "/usr/local/lib/python3.6/dist-packages/pandas/io/common.py", line 233, in get_filepath_or_buffer
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 399, in open
**kwargs
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in open_files
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in <listcomp>
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 460, in makedirs
self.mkdir(path, create_parents=True)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 450, in _mkdir
raise translate_boto_error(e) from e
PermissionError: Anonymous access is forbidden for this operation
I don't know why is it trying to create bucket?
and I have provided full access of s3 to lambda role
Can someone please tell me what i'm missing here?
Thank you.
I think this is due to an incompatibility between the pandas, boto3 and s3fs libraries.
Try this setup:
pandas==1.0.3
boto3==1.13.11
s3fs==0.4.2
I also tried with pandas version 1.1.1, it works too (I work with Python 3.7)

pyspark - Error while loading .csv file from url to Spark

pyspark load data from url
url = "https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
However, the following error occurred:
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
[Stage 0:>
(0 + 1) / 1]20/06/30 19:10:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: File /tmp/spark-1ee8b00f-8657-4cdc-8d7b-e3bc473bbce7/userFiles-f9e0a88d-8678-48c4-a21b-c06ce76d528b/cities.csv exists and does not match contents of https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv
20/06/30 19:10:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 499, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.csv.``
How should i solve the problem?
The problem is with your url..
In order to read data from github you have to pass the raw url instead.
On the data page click on raw and then copy that url to get the data
url = 'https://raw.githubusercontent.com/jokecamp/FootballData/master/openFootballData/cities.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("cities.csv"), header=True)

Pyspark celery task : toPandas() throwing Pickling error

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.
I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.
This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.
I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

Categories

Resources