Exception Handling in Apache Beam pipelines using Python - python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows.
On a simple WriteToBigQuery example:
output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name')
I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution:
INFO:root:2019-01-29T15:49:46.516Z: JOB_MESSAGE_ERROR: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error received from SDK harness for instruction -87: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 135, in _execute
response = task()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 170, in <lambda>
self._execute(lambda: worker.do_instruction(work), work)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 221, in do_instruction
request.instruction_id)
...
...
...
self.signature.finish_bundle_method.method_value())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1368, in finish_bundle
self._flush_batch()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1380, in _flush_batch
self.table_id, errors))
RuntimeError: Could not successfully insert rows to BigQuery table [<myproject:datasetname.tablename>]. Errors: [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u''
message: u'Missing required field: object.teste.'
reason: u'invalid'>]
index: 0>] [while running 'generatedPtransform-63']
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57)
org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:276)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:84)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1228)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction -87: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 135, in _execute
response = task()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 170, in <lambda>
self._execute(lambda: worker.do_instruction(work), work)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 221, in do_instruction
request.instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 237, in process_bundle
bundle_processor.process_bundle(instruction_id)
...
...
...
self.signature.finish_bundle_method.method_value())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1368, in finish_bundle
self._flush_batch()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery.py", line 1380, in _flush_batch
self.table_id, errors))
Even trying to handle this:
RuntimeError: Could not successfully insert rows to BigQuery table [<myproject:datasetname.tablename>]. Errors: [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u''
message: u'Missing required field: object.teste.'
reason: u'invalid'>]
index: 0>] [while running 'generatedPtransform-63']
Using:
try:
...
except RuntimeException as e:
...
Or using generic Exception didn't work.
I could find a lot of examples of errors handling in Apache Beam using Java, but no one in python handling errors.
Does anyone knows how to got this?

I've been only able to catch exceptions at the DoFn level, so something like this:
class MyPipelineStep(beam.DoFn):
def process(self, element, *args, **kwargs):
try:
# do stuff...
yield pvalue.TaggedOutput('main_output', output_element)
except Exception as e:
yield pvalue.TaggedOutput('exception', str(e))
However WriteToBigQuery is PTransform that wraps the DoFn BigQueryWriteFn
So you may need to do something like this
class MyBigQueryWriteFn(BigQueryWriteFn):
def process(self, *args, **kwargs):
try:
return super(BigQueryWriteFn, self).process(*args, **kwargs)
except Exception as e:
# Do something here
class MyWriteToBigQuery(WriteToBigQuery):
# Copy the source code of `WriteToBigQuery` here,
# but replace `BigQueryWriteFn` with `MyBigQueryWriteFn`
https://beam.apache.org/releases/pydoc/2.9.0/_modules/apache_beam/io/gcp/bigquery.html#WriteToBigQuery

You can also use the generator flavor of FlatMap:
This is similar to the other answer, in that you can use a DoFn in the place of something else, e.g. a CombineFn to produce no outputs when there is an exception or other kind of failed-preconditions.
def sum_values(values: List[int]) -> Generator[int, None, None]:
if not values or len(values) < 10:
logging.error(f'received invalid inputs: {...}')
return
yield sum(values)
# Now instead of use |CombinePerKey|
(inputs
| 'WithKey' >> beam.Map(lambda x: (x.key, x)) \
| 'GroupByKey' >> beam.GroupByKey() \
| 'Values' >> beam.Values() \
| 'MaybeSum' >> beam.FlatMap(sum_values))

Related

Using VS Code to debug python files. Exception thrown on breakpoint and breakpoint is ignored

Tried with multiple different python files. Every time I try to use the debugger in vs code and set breakpoints the breakpoint gets ignored and exception gets raised and the script continues on. I've been googling and tinkering for over 2 hours and can't seem to figure out what's going on here. Tried rebooting PC, running vs code as admin, uninstall/reinstall the python extension for vs code. Tried to dig into the files mentioned in the traceback and pinpointed the function that seems to be raising the exception but I can't figure out where it's being called from or why it's raising the exception. I'm still new-ish to Python. Debugging works properly on my laptop but for whatever reason my desktop is having this issue.
Traceback (most recent call last):
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd_file_utils.py", line 529, in _original_file_to_client
return cache[filename]
KeyError: 'c:\\users\\joel\\local settings\\application data\\programs\\python\\python37-32\\lib\\runpy.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 330, in _on_run
self.process_net_command_json(self.py_db, json_contents)
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_process_net_command_json.py", line 190, in process_net_command_json
cmd = on_request(py_db, request)
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_process_net_command_json.py", line 771, in on_stacktrace_request
self.api.request_stack(py_db, request.seq, thread_id, fmt=fmt, start_frame=start_frame, levels=levels)
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_api.py", line 214, in request_stack
if internal_get_thread_stack.can_be_executed_by(get_current_thread_id(threading.current_thread())):
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 661, in can_be_executed_by
py_db, self.seq, self.thread_id, frame, self._fmt, must_be_suspended=not timed_out, start_frame=self._start_frame, levels=self._levels)
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_net_command_factory_json.py", line 213, in make_get_thread_stack_message
py_db, frames_list
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_net_command_factory_xml.py", line 175, in _iter_visible_frames_info
new_filename_in_utf8, applied_mapping = pydevd_file_utils.norm_file_to_client(filename_in_utf8)
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd_file_utils.py", line 531, in _original_file_to_client
translated = _path_to_expected_str(get_path_with_real_case(_AbsFile(filename)))
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd_file_utils.py", line 221, in _get_path_with_real_case
return _resolve_listing(drive, iter(parts))
File "c:\Users\Joel\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd_file_utils.py", line 184, in _resolve_listing
dir_contents = cache[resolved_lower] = os.listdir(resolved)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\Joel\\Local Settings'
So I get this traceback every time a breakpoint is hit. Taking a peek at the "_original_file_to_client" function in "pydevd_file_utils.py" we get this:
def _original_file_to_client(filename, cache={}):
try:
return cache[filename]
except KeyError:
translated = _path_to_expected_str(get_path_with_real_case(_AbsFile(filename)))
cache[filename] = (translated, False)
return cache[filename]
I wasn't able to figure out where this function was being called from or what the expected output was supposed to be. Any help with this would be greatly appreciated!
Edit: Forgot to mention I'm using Windows 10 if it wasn't obvious from the trace
This is a similar question. The spaces in the filename cause this problem:
"Local Settings", "application data"

Pyspark celery task : toPandas() throwing Pickling error

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.
I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.
This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.
I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

Capturing Apscedular job exceptions to Sentry

I'm using apscheduler to process few things in background.
I'd like to capture and report possible exceptions to Sentry. My code looks like this:
sentry = Client(dsn=SENTRY_DSN)
def sample_method():
# some processing..
raise ConnectionError
def listen_to_exceptions(event):
if event.exception:
# I was hoping raven will capture the exception using sys.exc_info(), but it's not
sentry.captureException()
scheduler = BlockingScheduler()
scheduler.add_listener(listen_to_exceptions, EVENT_JOB_EXECUTED | EVENT_JOB_ERROR)
scheduler.add_job(sample_method, 'interval', minutes=5, max_instances=1)
# run forever!!!
scheduler.start()
But instead capturing the exception, it generates more exceptions trying to report it to Sentry.
ConnectionError
Error notifying listener
Traceback (most recent call last):
File "/.../venv/lib/python3.6/site-packages/apscheduler/schedulers/base.py", line 825, in _dispatch_event
cb(event)
File "app.py", line 114, in listen_to_exceptions
sentry.captureException(event.exception)
File "/.../venv/lib/python3.6/site-packages/raven/base.py", line 814, in captureException
'raven.events.Exception', exc_info=exc_info, **kwargs)
File "/.../venv/lib/python3.6/site-packages/raven/base.py", line 623, in capture
if self.skip_error_for_logging(exc_info):
File "/.../venv/lib/python3.6/site-packages/raven/base.py", line 358, in skip_error_for_logging
key = self._get_exception_key(exc_info)
File "/.../venv/lib/python3.6/site-packages/raven/base.py", line 345, in _get_exception_key
code_id = id(exc_info[2] and exc_info[2].tb_frame.f_code)
TypeError: 'ConnectionError' object is not subscriptable
I'm trying to use event listener according to the docs. Is there another way to capture exceptions in executed jobs?
Of course I could add try except blocks to each job function. I'm just trying to understand if there's a way to do it with apscedular, because I've 20+ jobs and adding sentry.captureException() every where seems like repetition.
You only need to capture EVENT_JOB_ERROR. Also, sentry.captureException() requires an exc_info tuple as its argument, not the exception object. The following will work on Python 3:
def listen_to_exceptions(event):
exc_info = type(event.exception), event.exception, event.exception.__traceback__
sentry.captureException(exc_info)
scheduler.add_listener(listen_to_exceptions, EVENT_JOB_ERROR)
The documentation has been updated. So you have to do it the following way:
from sentry_sdk import capture_exception
....
def sentry_listener(event):
if event.exception:
capture_exception(event.exception)
scheduler.add_listener(sentry_listener, EVENT_JOB_ERROR)

Cant catch exception in python

I use salt, and it raise the following exception when I run exec_state('update_salt') (The code below):
File "/usr/lib/python2.6/site-packages/salt/client/__init__.py", line 1582, in __init__
caller = salt.client.Caller()
File "/usr/lib/python2.6/site-packages/salt/minion.py", line 283, in __init__
for key, val in data.items():
File "/usr/lib/python2.6/site-packages/salt/minion.py", line 300, in gen_modules
File "/usr/lib/python2.6/site-packages/salt/loader.py", line 286, in render
opts,
salt.exceptions.LoaderError: The renderer yaml_jinja is unavailable, this error is often because the needed software is unavailable
I try to handle it by try and catch block:
try:
result = exec_state('update_salt')
if not result:
return False
except:
print "got it.."
result = exec_state('update_salt_light')
if not result:
return False
But it still failed in the first attemp, and doesn't get to the exception block (got it is not printed). Why?

Python multiprocessing pool.map raises IndexError

I've developed a utility using python/cython that sorts CSV files and generates stats for a client, but invoking pool.map seems to raise an exception before my mapped function has a chance to execute. Sorting a small number of files seems to function as expected, but as the number of files grows to say 10, I get the below IndexError after calling pool.map. Does anyone happen to recognize the below error? Any help is greatly appreciated.
While the code is under NDA, the use-case is fairly simple:
Code Sample:
def sort_files(csv_files):
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
sorted_dicts = pool.map(sort_file, csv_files, 1)
return sorted_dicts
def sort_file(csv_file):
print 'sorting %s...' % csv_file
# sort code
Output:
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range
The IndexError is an error you get somewhere in sort_file(), i.e. in a subprocess. It is re-raised by the parent process. Apparently multiprocessing doesn't make any attempt to inform us about where the error really comes from (e.g. on which lines it occurred) or even just what argument to sort_file() caused it. I hate multiprocessing even more :-(
Check further up in the command output.
In Python 3.4 at least, multiprocessing.pool will helpfully print a RemoteTraceback above the parent process traceback. You'll see something like:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/path/to/your/code/here.py", line 80, in sort_file
something = row[index]
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "generic.pyx", line 17, in generic.sort_files (/users/cyounker/.pyxbld/temp.linux-x86_64-2.7/pyrex/generic.c:1723)
sorted_dicts = pool.map(sort_file, csv_files, 1)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
IndexError: list index out of range
In the case above, the code raising the error is at /path/to/your/code/here.py", line 80
see also debugging errors in python multiprocessing

Categories

Resources