How to use pyarrow parquet with multiprocessing - python

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing.
The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely.
My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.
I've tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.
So, what can be the possible causes? How would I debug this?
Code:
import pyarrow.parquet as pq
def read_pq(file):
table = pq.read_table(file)
return table
##### this works #####
table = read_pq('hdfs://myns/mydata/000000_0')
###### this doesnt work #####
import multiprocessing
result_async=[]
with Pool(1) as pool:
result_async.append( pool.apply_async(pq.read_table, args = ('hdfs://myns/mydata/000000_0',)) )
results = [r.get() for r in result_async] ###### hangs here indefinitely, no exceptions raised
print(results) ###### expecting to get List[pq.Table]
#########################

Have you tried importing pq inside a user defined function so that any initilization required per process (needed by the library) can happen in each process in the pool?
def read_pq(file):
import pyarrow.parquet as pq
table = pq.read_table(file)
return table
###### this doesnt work #####
import multiprocessing
result_async=[]
with Pool(1) as pool:
result_async.append( pool.apply_async(read_pq, args = ('hdfs://myns/mydata/000000_0',)) )
results = [r.get() for r in result_async] ###### hangs here indefinitely, no exceptions raised
print(results) ###### expecting to get List[pq.Table]
#########################

Problem is due to my lack of experience with multiprocessing.
Solution is to add:
from multiprocessing import set_start_method
set_start_method("spawn")
The solution and the reason is exactly what
https://pythonspeed.com/articles/python-multiprocessing/
describes: The logging got forked and caused deadlock.
Furthermore, although I had only "Pool(1)", in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.

Related

Python multiprocessing pool map - Memory issues in Databricks

I am running a python component in Databricks environment which creates a set of JSON messages and each JSON message is encoded with Avro schema. The encoding was taking longer time (8 minutes for encoding 10K messages which have complex JSON structure) and hence I tried to use multiprocessing with pool map function. The process seems to work fine for the first execution, however for subsequent runs, the performance is degrading and eventually failing with oom error. I am making sure that at the end of execution pool.close() and pool.join() are issued but not sure if it's really freeing up the memory. When I look at Databricks Ganglia UI, it shows that Swap memory and CPU utilization is increasing for each run. I also tried to reduce the no of pools (driver node has 8 cores, so tried with 6 and 4 pools) and also maxtasksperchild=1 but still doesn't help. I am wondering if I'm doing anything wrong. Following is the code which I'm using now. Wondering what is cuasing the issue here. Any pointers / suggestions are appreciated.
from multiprocessing import Pool
import multiprocessing
import json
from avro.io import *
import avro.schema
from avro_json_serializer import AvroJsonSerializer, AvroJsonDeserializer
import pyspark.sql.functions as F
def create_json_avro_encoding(row):
row_dict = row.asDict(True)
json_data = json.loads(avro_serializer.to_json(row_dict))
#print(f"JSON created { multiprocessing.current_process().name }")
return json_data
avro_schema = avro.schema.SchemaFromJSONData(avro_schema_dict, avro.schema.Names())
avro_serializer = AvroJsonSerializer(avro_schema)
records = df.collect()
pool_cnt = int(multiprocessing.cpu_count()*0.5)
print(f"No of records: {len(records)}")
print(f"starting timestamp {datetime.now().isoformat(sep=' ')}")
with Pool(pool_cnt, maxtasksperchild=1) as pool:
json_data_ret = pool.map(create_json_avro_encoding, records)
pool.close()
pool.join()
You shouldn't close the pool before joining. In fact, you shouldn't close the pool at all when using it in a with block, it will close automatically when exiting the with block.

Multiprocessing spawns idle processes and doesn't compute anything

There seems to be a litany of questions and answers on overflow about the multiprocessing library. I have looked through all the relevant ones I can find all and have not found one that directly speaks to my problem.
I am trying to apply the same function to multiple files in parallel. Whenever I start the processing though, the computer just spins up several instances of python and then does nothing. No computations happen at all and the processes just sit idle
I have looked at all of the similar questions on overflow, and none seem to have my problem of idle processes.
what am i doing wrong?
define the function (abbreviated for example. checked to make sure it works)
import pandas as pd
import numpy as np
import glob
import os
#from timeit import default_timer as timer
import talib
from multiprocessing import Process
def example_function(file):
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
print(f'stock{stock_name} processed')
final_macd_report.append(df)
getting a list of all the files in the directory i want to run the function on
import glob
path = r'C:\Users\josiahh\Desktop\big_test3/*'
files = [f for f in glob.glob(path, recursive=True)]
attempting multiprocessing
import multiprocessing as mp
if __name__ == '__main__':
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, files)
p.close()
p.join()
print("Complete")
any help would be greatly appreciated.
There's nothing wrong with the structure of the code, so something is going wrong that can't be guessed from what you posted. Start with something very much simpler, then move it in stages to what you're actually trying to do. You're importing mountains of extension (3rd party) code, and the problem could be anywhere. Here's a start:
def example_function(arg):
from time import sleep
msg = "crunching " + str(arg)
print(msg)
sleep(arg)
print("done " + msg)
if __name__ == '__main__':
import multiprocessing as mp
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, reversed(range(15)))
print("result", async_result.get())
p.close()
p.join()
print("Complete")
That works fine on Win10 under 64-bit Python 3.7.4 for me. Does it for you?
Note especially the async_result.get() at the end. That displays a list with 15 None values. You never do anything with your async_result. Because of that, if any exception was raised in a worker process, it will most likely silently vanish. In such cases .get()'ing the result will (re)raise the exception in your main program.
Also please verify that your files list isn't in fact empty. We can't guess at that from here either ;-)
EDIT
I moved the async_result.get() into its own line, right after the map_async(), to maximize the chance of revealing otherwise silent exception in the worker processes. At least add that much to your code too.
While I don't see anything wrong per se, I would like to suggest some changes.
In general, worker functions in a Pool are expected to return something. This return value is transferred back to the parent process. I like to use that as a status report. It is also a good idea to catch exceptions in the worker process, just in case.
For example:
def example_function(file):
status = 'OK'
try:
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
final_macd_report.append(df)
except:
status = 'exception caught!'
return {'filename': file, 'result': status}
(This is just a quick example. You might want to e.g. report the full exception traceback to help with debugging.)
If workers run for a long time, I like to get feedback ASAP.
So I prefer to use imap_unordered, especially if some tasks can take much longer than others. This returns an iterator that yields results in the order that jobs finish.
if __name__ == '__main__':
with mp.Pool() as p:
for res in p.imap_unordered(example_function, files):
print(res)
This way you get unambiguous proof that a worker finished, and what the result was and if any problems occurred.
This is preferable over just calling print from the workers. With stdout buffering and multiple workers inheriting the same output stream there is no saying when you actually see something.
Edit: As you can see here, multiprocessing.Pool does not work well with interactive interpreters, especially on ms-windows. Basically, ms-windows lacks the fork system call that lets UNIX-like systems duplicate a process. So on ms-windows, multiprocessing has to do a try and mimic fork which means importing the original program file in the child processes. That doesn't work well with interactive interpreters like IPython. One would probably have to dig deep into the internals of Jupyter and multiprocessing to find out the exact cause of the problem.
It seems that a workaround for this problem is to define the worker function in a separate module and import that in your code in IPython.
It is actually mentioned in the documentation that multiprocessing.Pool doesn't work well with interactive interpreters. See the note at the end of this section.

How to fix BrokenProcessPool: error for concurrent.futures ProcessPoolExecutor

Using concurrent.futures.ProcessPoolExecutor I am trying to run the first piece of code to execute the function "Calculate_Forex_Data_Derivatives(data,gride_spacing)" in parallel. When calling the results, executor_list[i].result(), I get "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending." I have tried running the code sending multiple calls of the function to the processing pool as well as running the code only sending one call to the processing pool, both resulting in the error.
I have also tested the structure of the code with a simpler piece of code (2nd code provided) with the same types of input for the call function and it works fine. The only thing different that I can see between the two pieces of code is the first code calls the function "FinDiff(axis,grid_spacing,derivative_order)" from the 'findiff' module. This function along with the "Calculate_Forex_Data_Derivatives(data,gride_spacing)" work perfectly on there own when running normally in series.
I am using Anaconda environment, Spyder editor, and Windows.
Any help would be appreciated.
#code that returns "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending."
import pandas as pd
import numpy as np
from findiff import FinDiff
import multiprocessing
import concurrent.futures
def Calculate_Forex_Data_Derivatives(forex_data,dt): #function to run in parallel
try:
dClose_dt = FinDiff(0,dt,1)(forex_data)[-1]
except IndexError:
dClose_dt = np.nan
try:
d2Close_dt2 = FinDiff(0,dt,2)(forex_data)[-1]
except IndexError:
d2Close_dt2 = np.nan
try:
d3Close_dt3 = FinDiff(0,dt,3)(forex_data)[-1]
except IndexError:
d3Close_dt3 = np.nan
return dClose_dt, d2Close_dt2, d3Close_dt3
#input for function
#forex_data is pandas dataframe, forex_data['Close'].values is numpy array
#dt is numpy array
#input_1 and input_2 are each a list of numpy arrays
input_1 = []
input_2 = []
for forex_data_index,data_point in enumerate(forex_data['Close'].values[:1]):
input_1.append(forex_data['Close'].values[:forex_data_index+1])
input_2.append(dt[:forex_data_index+1])
def multi_processing():
executors_list = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for index in range(len(input_1)):
executors_list.append(executor.submit(Calculate_Forex_Data_Derivatives,input_1[index],input_2[index]))
return executors_list
if __name__ == '__main__':
print('calculating derivatives')
executors_list = multi_processing()
for output in executors_list
print(output.result()) #returns "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending."
##############################################################
#simple example that runs fine
def function(x,y): #function to run in parallel
try:
asdf
except NameError:
a = (x*y)[0]
b = (x+y)[0]
return a,b
x=[np.array([0,1,2]),np.array([3,4,5])] #function inputs, list of numpy arrays
y=[np.array([6,7,8]),np.array([9,10,11])]
def multi_processing():
executors_list = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for index,_ in enumerate(x):
executors_list.append(executor.submit(function,x[index],y[index]))
return executors_list
if __name__ == '__main__':
executors_list = multi_processing()
for output in executors_list: #prints as expected
print(output.result()) #(0, 6)
#(27, 12)
I know three typical ways to break the Pipe of a ProcessPoolExecutor:
OS kill/termination
Your system runs into limits, most likely memory, and starts killing processes. As a fork on windows clones your memory content, this is not unlikely when working with large DataFrames.
How to identify
Check memory consumption in your task manager.
Unless your DataFrames occupy half of your memory, it should disappear with max_workers=1, this is not unambiguous however.
Self-Termination of the Worker
The Python instance of the subprocess terminates due to some error that does not raise a proper Exception. One example would be a segfault in an imported C-module.
How to identify
As your code runs properly without the PPE, the only scenario I can think of is if some module is not multiprocessing-safe. It then also has a chance to disappear with max_workers=1. It might also be possible to induce the Error in the main process by calling the function manually right after the workers are created (the line after the for-loop that calls executor.submit.
Otherwise it could be really hard to identify, but in my opinion it is the most unlikely case.
Exception in PPE Code
The subprocess side of the pipe (i.e. code handling the communication) may crash, which results in a proper Exception, that unfortunately can not be communicated to the master process.
How to identify
As the code is (hopefully) well tested, the prime suspect lies in the return data. It must be pickled and sent back via socket - both steps can crash. So you have to check:
is the return data picklable?
is the pickled object small enough to be sent (about 2GB)?
So you can either try to return some simple dummy-data instead, or check the two conditions explicitely:
if len(pickle.dumps((dClose_dt, d2Close_dt2, d3Close_dt3))) > 2 * 10 ** 9:
raise RuntimeError('return data can not be sent!')
In Python 3.7, this problem is fixed, and it sends back the Exception.
I found this in the official documents:
"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."
Have you ever tried this? The following works for me:
if __name__ == '__main__':
executors_list = multi_processing()
for output in executors_list:
print(output.result())

Python - Multiprocessing pool.map(): Processes don't do anything

I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.

Python Multiprocessing using Pool goes recursively haywire

I'm trying to make an expensive part of my pandas calculations parallel to speed up things.
I've already managed to make Multiprocessing.Pool work with a simple example:
import multiprocessing as mpr
import numpy as np
def Test(l):
for i in range(len(l)):
l[i] = i**2
return l
t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
E = pool.map(Test,L)
pool.close()
pool.join()
No problems here. Now my own algorithm is a bit more complicated, I can't post it here in its full glory and terribleness, so I'll use some pseudo-code to outline the things I'm doing there:
import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd --> self-written class that reads in all the data and puts it into dataframes
=== Settings
=== Use ClassGetDataFrames to get data
=== Lots of single-thread calculations and manipulations on the dataframe
=== Cut dataframe into 4 evenly big chunks, make list of them called DDC
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
LLT = pool.map(mpf.processChunks,DDC)
pool.close()
pool.join()
=== Join processed Chunks LLT back into one dataframe
=== More calculations and manipulations
=== Data Output
When I'm running this script the following happens:
It reads in the data.
It does all calculations and manipulations until the Pool statement.
Suddenly it reads in the data again, fourfold.
Then it goes into the main script fourfold at the same time.
The whole thing cascades recursively and goes haywire.
I have read before that this can happen if you're not careful, but I do not know why it does happen here. My multiprocessing code is protected by the needed name-main-statement (I'm on Win7 64), it is only 4 lines long, it has close and join statements, it calls one defined worker function which then calls a second worker function in a loop, that's it. By all I know it should just create the pool with four processes, call the four processes from the imported script, close the pool and wait until everything is done, then just continue with the script. On a sidenote, I first had the worker functions in the same script, the behaviour was the same. Instead of just doing what's in the pool it seems to restart the whole script fourfold.
Can anyone enlighten me what might cause this behaviour? I seem to be missing some crucial understanding about Python's multiprocessing behaviour.
Also I don't know if it's important, I'm on a virtual machine that sits on my company's mainframe.
Do I have to use individual processes instead of a pool?
I managed to make it work by enceasing the entire script into the if __name__ == "__main__":-statement, not just the multiprocessing part.

Categories

Resources