Pyspark celery task : toPandas() throwing Pickling error

Pyspark celery task : toPandas() throwing Pickling error - python

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.

I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.

This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.

I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

Related

PermissionError while using pandas .to_csv function

I have this code to read .txt file from s3 and convert this file to .csv using pandas:
file = pd.read_csv(f's3://{bucket_name}/{bucket_key}', sep=':', error_bad_lines=False)
file.to_csv(f's3://{bucket_name}/file_name.csv')
I have provided read write permission to IAM role but still this errors comes for the .to_csv function:
Anonymous access is forbidden for this operation: PermissionError
update: full error in ec2 logs is:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 446, in _mkdir
await self.s3.create_bucket(**params)
File "/usr/local/lib/python3.6/dist-packages/aiobotocore/client.py", line 134, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous access is forbidden for this operation
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "convert_file_instance.py", line 92, in <module>
main()
File "convert_file_instance.py", line 36, in main
raise e
File "convert_file_instance.py", line 30, in main
file.to_csv(f's3://{bucket_name}/file_name.csv')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3165, in to_csv
decimal=decimal,
File "/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py", line 67, in __init__
path_or_buf, encoding=encoding, compression=compression, mode=mode
File "/usr/local/lib/python3.6/dist-packages/pandas/io/common.py", line 233, in get_filepath_or_buffer
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 399, in open
**kwargs
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in open_files
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in <listcomp>
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 460, in makedirs
self.mkdir(path, create_parents=True)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 450, in _mkdir
raise translate_boto_error(e) from e
PermissionError: Anonymous access is forbidden for this operation
I don't know why is it trying to create bucket?
and I have provided full access of s3 to lambda role
Can someone please tell me what i'm missing here?
Thank you.

I think this is due to an incompatibility between the pandas, boto3 and s3fs libraries.
Try this setup:
pandas==1.0.3
boto3==1.13.11
s3fs==0.4.2
I also tried with pandas version 1.1.1, it works too (I work with Python 3.7)

Airflow scheduler crashes when a DAG is run

The airflow scheduler crashes when I trigger it manually from the dashboard.
executor = DaskExecutor
Airflow version: = 1.10.7
sql_alchemy_conn = postgresql://airflow:airflow#localhost:5432/airflow
python version = 3.6
The logs on crash are:
[2020-08-20 07:01:49,288] {scheduler_job.py:1148} INFO - Sending ('hello_world', 'dummy_task', datetime.datetime(2020, 8, 20, 1, 31, 47, 20630, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 2 and queue default
[2020-08-20 07:01:49,288] {base_executor.py:58} INFO - Adding to queue: ['airflow', 'run', 'hello_world', 'dummy_task', '2020-08-20T01:31:47.020630+00:00', '--local', '--pool', 'default_pool', '-sd', '/workflows/dags/helloWorld.py']
/mypython/lib/python3.6/site-packages/airflow/executors/dask_executor.py:63: UserWarning: DaskExecutor does not support queues. All tasks will be run in the same cluster
'DaskExecutor does not support queues. '
distributed.protocol.pickle - INFO - Failed to serialize <function DaskExecutor.execute_async.<locals>.airflow_run at 0x12057a9d8>. Exception: Cell is empty
[2020-08-20 07:01:49,292] {scheduler_job.py:1361} ERROR - Exception when executing execute_helper
Traceback (most recent call last):
File "/mypython/lib/python3.6/site-packages/distributed/worker.py", line 843, in dumps_function
result = cache[func]
KeyError: <function DaskExecutor.execute_async.<locals>.airflow_run at 0x12057a9d8>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mypython/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
AttributeError: Can't pickle local object 'DaskExecutor.execute_async.<locals>.airflow_run'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mypython/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1359, in _execute
self._execute_helper()
File "/mypython/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1420, in _execute_helper
if not self._validate_and_run_task_instances(simple_dag_bag=simple_dag_bag):
File "/mypython/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1482, in _validate_and_run_task_instances
self.executor.heartbeat()
File "/mypython/lib/python3.6/site-packages/airflow/executors/base_executor.py", line 130, in heartbeat
self.trigger_tasks(open_slots)
File "/mypython/lib/python3.6/site-packages/airflow/executors/base_executor.py", line 154, in trigger_tasks
executor_config=simple_ti.executor_config)
File "/mypython/lib/python3.6/site-packages/airflow/executors/dask_executor.py", line 70, in execute_async
future = self.client.submit(airflow_run, pure=False)
File "/mypython/lib/python3.6/site-packages/distributed/client.py", line 1279, in submit
actors=actor)
File "/mypython/lib/python3.6/site-packages/distributed/client.py", line 2249, in _graph_to_futures
'tasks': valmap(dumps_task, dsk3),
File "/mypython/lib/python3.6/site-packages/toolz/dicttoolz.py", line 83, in valmap
rv.update(zip(iterkeys(d), map(func, itervalues(d))))
File "/mypython/lib/python3.6/site-packages/distributed/worker.py", line 881, in dumps_task
return {'function': dumps_function(task[0]),
File "/mypython/lib/python3.6/site-packages/distributed/worker.py", line 845, in dumps_function
result = pickle.dumps(func)
File "/mypython/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 51, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "/mypython/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 101, in dumps
cp.dump(obj)
File "/mypython/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/mypython/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 722, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/mypython/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 610, in save_reduce
save(args)
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 736, in save_tuple
save(element)
File "/usr/local/opt/python#3.6/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/mypython/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
[2020-08-20 07:01:49,302] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID 11451
[2020-08-20 07:01:49,303] {dag_processing.py:804} INFO - Exiting gracefully upon receiving signal 15
[2020-08-20 07:01:49,310] {dag_processing.py:1379} INFO - Waiting up to 5 seconds for processes to exit...
[2020-08-20 07:01:49,318] {helpers.py:288} INFO - Process psutil.Process(pid=11451, status='terminated') (11451) terminated with exit code 0
[2020-08-20 07:01:49,319] {helpers.py:288} INFO - Process psutil.Process(pid=11600, status='terminated') (11600) terminated with exit code None
[2020-08-20 07:01:49,319] {scheduler_job.py:1364} INFO - Exited execute loop
I am running it on macOS Catalina, if that might help to isolate the error.

I believe this issue is possibly what you are experiencing.
Looking at that ticket, it appears to still be open as a fix has been made, but has not yet made it to an official release.
This pull request contains the fix for the issue linked above - you could try building your Airflow stack locally from there, and see if it resolves the issue for you.

This started happening with new versions in downstream Dask dependencies. Locking the versions fixes the issue.
pip uninstall cloudpickle distributed
pip install cloudpickle==1.4.1 distributed==2.17.0
These were the problematic versions:
cloudpickle==1.6.0
distributed==2.26.0
I run Airflow 1.10.10 in docker and use the same image for Dask 2.13.0.

ImportError when running NuPIC model on PySpark

I am trying to run NuPIC on PySpark but I am getting an ImportError. Does anyone have any ideas for how I can fix it?
The code runs fine when I don't use PySpark, but I am trying to run it from a Spark Dataset now.
I am trying to run it using the source code I have in my directory, since running it by installing the Nupic package causes some other errors.
Thank you for your help!!
I am trying to run this function
input_data.rdd.foreach(lambda row: iterateRDD(row, model))
def iterateRDD(record, model):
modelInput = record.asDict(False)
modelInput["value"] = float(modelInput["value"])
modelInput["timestamp"] = datetime.datetime.strptime(modelInput["timestamp"], "%Y-%m-%d %H:%M:%S")
print"modelInput", modelInput
result = model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print "Anomaly score is", anomalyScore
However, I get this error and don't understand it.
File
"C:/Users/rakshit.trn/Documents/Nupic/nupic-master/examples/anomaly.py",
line 100, in runAnomaly
input_data.rdd.foreach(lambda row: iterateRDD(row, model)) File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 789,
in foreach self.mapPartitions(processPartition).count() # Force
evaluation File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py",
line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _
in i)]).sum() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 1046, in
sum return self.mapPartitions(lambda x: [sum(x)]).fold(0,
operator.add) File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 917, in
fold vals = self.mapPartitions(func).collect() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 816, in
collect sock_info =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File
"C:\Python\Python27\lib\site-packages\py4j\java_gateway.py", line
1257, in call answer, self.gateway_client, self.target_id,
self.name) File
"C:\Python\Python27\lib\site-packages\pyspark\sql\utils.py", line 63,
in deco return f(*a, **kw) File
"C:\Python\Python27\lib\site-packages\py4j\protocol.py", line 328, in
get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 364, in main File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 69, in read_command File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 172, in _read_with_length return self.loads(obj) File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 583, in loads return pickle.loads(obj) ImportError: No module
named frameworks.opf.htm_prediction_model
I guess that NuPIC is not able to get access to the frameworks/opf/htm_prediction_model.py file

You might be running an old version of NuPIC. See https://discourse.numenta.org/t/warning-0-7-0-breaking-changes/2200 and check what version you are using (https://discourse.numenta.org/t/how-to-check-what-version-of-nupic-is-installed/1045)

Exception Handling in Apache Beam pipelines using Python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows.
On a simple WriteToBigQuery example:
output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name')
I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution:
INFO:root:2019-01-29T15:49:46.516Z: JOB_MESSAGE_ERROR: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error received from SDK harness for instruction -87: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 135, in _execute
response = task()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 170, in <lambda>
self._execute(lambda: worker.do_instruction(work), work)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 221, in do_instruction
request.instruction_id)
...
...
...
self.signature.finish_bundle_method.method_value())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1368, in finish_bundle
self._flush_batch()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1380, in _flush_batch
self.table_id, errors))
RuntimeError: Could not successfully insert rows to BigQuery table [<myproject:datasetname.tablename>]. Errors: [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u''
message: u'Missing required field: object.teste.'
reason: u'invalid'>]
index: 0>] [while running 'generatedPtransform-63']
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57)
org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:276)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:84)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1228)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction -87: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 135, in _execute
response = task()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 170, in <lambda>
self._execute(lambda: worker.do_instruction(work), work)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 221, in do_instruction
request.instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 237, in process_bundle
bundle_processor.process_bundle(instruction_id)
...
...
...
self.signature.finish_bundle_method.method_value())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1368, in finish_bundle
self._flush_batch()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1380, in _flush_batch
self.table_id, errors))
Even trying to handle this:
RuntimeError: Could not successfully insert rows to BigQuery table [<myproject:datasetname.tablename>]. Errors: [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u''
message: u'Missing required field: object.teste.'
reason: u'invalid'>]
index: 0>] [while running 'generatedPtransform-63']
Using:
try:
...
except RuntimeException as e:
...
Or using generic Exception didn't work.
I could find a lot of examples of errors handling in Apache Beam using Java, but no one in python handling errors.
Does anyone knows how to got this?

I've been only able to catch exceptions at the DoFn level, so something like this:
class MyPipelineStep(beam.DoFn):
def process(self, element, *args, **kwargs):
try:
# do stuff...
yield pvalue.TaggedOutput('main_output', output_element)
except Exception as e:
yield pvalue.TaggedOutput('exception', str(e))
However WriteToBigQuery is PTransform that wraps the DoFn BigQueryWriteFn
So you may need to do something like this
class MyBigQueryWriteFn(BigQueryWriteFn):
def process(self, *args, **kwargs):
try:
return super(BigQueryWriteFn, self).process(*args, **kwargs)
except Exception as e:
# Do something here
class MyWriteToBigQuery(WriteToBigQuery):
# Copy the source code of `WriteToBigQuery` here,
# but replace `BigQueryWriteFn` with `MyBigQueryWriteFn`
https://beam.apache.org/releases/pydoc/2.9.0/_modules/apache_beam/io/gcp/bigquery.html#WriteToBigQuery

You can also use the generator flavor of FlatMap:
This is similar to the other answer, in that you can use a DoFn in the place of something else, e.g. a CombineFn to produce no outputs when there is an exception or other kind of failed-preconditions.
def sum_values(values: List[int]) -> Generator[int, None, None]:
if not values or len(values) < 10:
logging.error(f'received invalid inputs: {...}')
return
yield sum(values)
# Now instead of use |CombinePerKey|
(inputs
| 'WithKey' >> beam.Map(lambda x: (x.key, x)) \
| 'GroupByKey' >> beam.GroupByKey() \
| 'Values' >> beam.Values() \
| 'MaybeSum' >> beam.FlatMap(sum_values))

Official Softlayer script to retrieve locations ids does not work

I used to use the following script for retrieving the different locations ids, to create an VSI order:
https://softlayer.github.io/python/list_packages/
Specifically:
def getAllLocations(self):
mask = "mask[id,locations[id,name]]"
result = self.client['SoftLayer_Location_Group_Pricing'].getAllObjects(mask=mask);
pp(result)
Unfortunately meanwhile it throws the following exception:
Traceback (most recent call last):
File "new.py", line 59, in <module>
main.getAllLocations()
File "new.py", line 52, in getAllLocations
result = self.client['SoftLayer_Location_Group_Pricing'].getAllObjects(mask=mask);
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 392, in call_handler
return self(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 360, in call
return self.client.call(self.name, name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 263, in call
return self.transport(request)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/transports.py", line 195, in __call__
raise _ex(ex.faultCode, ex.faultString)
SoftLayer.exceptions.SoftLayerAPIError: SoftLayerAPIError(SOAP-ENV:Server): Internal Error
Is there something that needs to be changed within the function?

Nope your code is fine the problem is that this is an issue with the method which is not working I am gonna report the issue or if you want it you can open an softlayer's ticket and report the issue yourself.

An issue with the getAllObjects method was fixed yesterday for this service. Please try the request again.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark celery task : toPandas() throwing Pickling error - python

This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.

I'm answering my own question. It was probably a pyspark 2.3 bug I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5. I downgraded to Pyspark 2.1.2 and everything worked fine.

Related

PermissionError while using pandas .to_csv function

Airflow scheduler crashes when a DAG is run

ImportError when running NuPIC model on PySpark

Exception Handling in Apache Beam pipelines using Python

Official Softlayer script to retrieve locations ids does not work

Categories

Resources