Python Multiprocessing for DataFrame Operations/Functions

Python Multiprocessing for DataFrame Operations/Functions - python

I am processing 100,000s of rows of text data using Pandas Dataframes. Every so often (<5 per 100,000) I have an error for a row that I have chosen to drop. The error handling function is as follows:
def unicodeHandle(datai):
for i, row in enumerate(datai['LDTEXT']):
print(i)
#print(text)
try:
text = row.read()
text.strip().split('[\W_]+')
print(text)
except UnicodeDecodeError as e:
datai.drop(i, inplace=True)
print('Error at index {}: {!r}'.format(i, row))
print(e)
return datai
The function works fine, and I have been using it a few weeks.
The problem is that I never know when the error will occur as the data comes from a DB that is constantly being added to (or I may pull different data). Point being, I must iterate through every row to run my error test function unicodeHandle in order initialize my data. This process takes about ~5 minutes which gets a little annoying. I am trying to implement multiprocessing to speed up the loop. Via the web and various tutorials, I have come up with:
def unicodeMP(datai):
chunks = [datai[i::8] for i in range(8)]
pool = mp.Pool(processes=8)
results = pool.apply_async(unicodeHandle, chunks)
while not results.ready():
print("One Sec")
return results.get()
if __name__ == "__main__":
fast = unicodeMP(datai)
When I run it the multiprocessing, it takes the same amount of time as regular even through my CPU says it is running at a WAY higher utilization. In addition, the code returns the error as a normal error instead of my completed clean dataframe. What am I missing here?
How can I use multiprocessing for functions on DataFrames?

You can try dask for multiprocessing a dataframe
import dask.dataframe as dd
partitions = 7 # cpu_cores - 1
ddf = dd.from_pandas(df, npartitions=partitions)
ddf.map_partitions(lambda df: df.apply(unicodeHandle).compute(scheduler='processes')
You can read more about dask here

Related

Multiprocessing is not executing parallel in Python

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv

welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.

There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

How to fix BrokenProcessPool: error for concurrent.futures ProcessPoolExecutor

Using concurrent.futures.ProcessPoolExecutor I am trying to run the first piece of code to execute the function "Calculate_Forex_Data_Derivatives(data,gride_spacing)" in parallel. When calling the results, executor_list[i].result(), I get "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending." I have tried running the code sending multiple calls of the function to the processing pool as well as running the code only sending one call to the processing pool, both resulting in the error.
I have also tested the structure of the code with a simpler piece of code (2nd code provided) with the same types of input for the call function and it works fine. The only thing different that I can see between the two pieces of code is the first code calls the function "FinDiff(axis,grid_spacing,derivative_order)" from the 'findiff' module. This function along with the "Calculate_Forex_Data_Derivatives(data,gride_spacing)" work perfectly on there own when running normally in series.
I am using Anaconda environment, Spyder editor, and Windows.
Any help would be appreciated.
#code that returns "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending."
import pandas as pd
import numpy as np
from findiff import FinDiff
import multiprocessing
import concurrent.futures
def Calculate_Forex_Data_Derivatives(forex_data,dt): #function to run in parallel
try:
dClose_dt = FinDiff(0,dt,1)(forex_data)[-1]
except IndexError:
dClose_dt = np.nan
try:
d2Close_dt2 = FinDiff(0,dt,2)(forex_data)[-1]
except IndexError:
d2Close_dt2 = np.nan
try:
d3Close_dt3 = FinDiff(0,dt,3)(forex_data)[-1]
except IndexError:
d3Close_dt3 = np.nan
return dClose_dt, d2Close_dt2, d3Close_dt3
#input for function
#forex_data is pandas dataframe, forex_data['Close'].values is numpy array
#dt is numpy array
#input_1 and input_2 are each a list of numpy arrays
input_1 = []
input_2 = []
for forex_data_index,data_point in enumerate(forex_data['Close'].values[:1]):
input_1.append(forex_data['Close'].values[:forex_data_index+1])
input_2.append(dt[:forex_data_index+1])
def multi_processing():
executors_list = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for index in range(len(input_1)):
executors_list.append(executor.submit(Calculate_Forex_Data_Derivatives,input_1[index],input_2[index]))
return executors_list
if __name__ == '__main__':
print('calculating derivatives')
executors_list = multi_processing()
for output in executors_list
print(output.result()) #returns "BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending."
##############################################################
#simple example that runs fine
def function(x,y): #function to run in parallel
try:
asdf
except NameError:
a = (x*y)[0]
b = (x+y)[0]
return a,b
x=[np.array([0,1,2]),np.array([3,4,5])] #function inputs, list of numpy arrays
y=[np.array([6,7,8]),np.array([9,10,11])]
def multi_processing():
executors_list = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for index,_ in enumerate(x):
executors_list.append(executor.submit(function,x[index],y[index]))
return executors_list
if __name__ == '__main__':
executors_list = multi_processing()
for output in executors_list: #prints as expected
print(output.result()) #(0, 6)
#(27, 12)

I know three typical ways to break the Pipe of a ProcessPoolExecutor:
OS kill/termination
Your system runs into limits, most likely memory, and starts killing processes. As a fork on windows clones your memory content, this is not unlikely when working with large DataFrames.
How to identify
Check memory consumption in your task manager.
Unless your DataFrames occupy half of your memory, it should disappear with max_workers=1, this is not unambiguous however.
Self-Termination of the Worker
The Python instance of the subprocess terminates due to some error that does not raise a proper Exception. One example would be a segfault in an imported C-module.
How to identify
As your code runs properly without the PPE, the only scenario I can think of is if some module is not multiprocessing-safe. It then also has a chance to disappear with max_workers=1. It might also be possible to induce the Error in the main process by calling the function manually right after the workers are created (the line after the for-loop that calls executor.submit.
Otherwise it could be really hard to identify, but in my opinion it is the most unlikely case.
Exception in PPE Code
The subprocess side of the pipe (i.e. code handling the communication) may crash, which results in a proper Exception, that unfortunately can not be communicated to the master process.
How to identify
As the code is (hopefully) well tested, the prime suspect lies in the return data. It must be pickled and sent back via socket - both steps can crash. So you have to check:
is the return data picklable?
is the pickled object small enough to be sent (about 2GB)?
So you can either try to return some simple dummy-data instead, or check the two conditions explicitely:
if len(pickle.dumps((dClose_dt, d2Close_dt2, d3Close_dt3))) > 2 * 10 ** 9:
raise RuntimeError('return data can not be sent!')
In Python 3.7, this problem is fixed, and it sends back the Exception.

I found this in the official documents:
"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."
Have you ever tried this? The following works for me:
if __name__ == '__main__':
executors_list = multi_processing()
for output in executors_list:
print(output.result())

python multiprocessing subprocess - high VIRT usage leads to memory error

I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?

Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

Python: Reading from Queue slows the ability to write to Queue?

I encountered a very puzzling issue while working with Python's multiprocessing module.
The setup is pretty typical. My machine has 32 cores and 244 GB of RAM (thank you AWS). One process to write to an ingestion queue. N processes to do the work I need done, process_data(). M processes to do some preaggregation, preaggregate_results(). One process to do the final aggregation and write the output.
If N is 'large' and M is only 1 or 2, then process_data() is very fast. It basically keeps up with the ingestion process. But since M is very small, the preaggregation is relatively slow and the intermediate_results queue bloats.
Here is the heart of the issue. Every increase in M results in a MARKED decrease in process_data()'s ability to write to the intermediate_results queue. In fact, if N==M==12, the process is so slow that it's not even reasonable to wait for the job to finish. process_data() goes from pacing with the ingestion queue to getting left in the dust.
I included some skeleton code below that just outlines the work flow I'm talking about. It's not to be taken literally. I'm curious if anyone else has encountered this issue before and knows how to solve it. I've talked to many of my coworkers (including code review) and they are just as stumped as I am.
I use multiprocessing all the time with success. This is the first time I've encountered this issue. Any thoughts would be greatly appreciated.
from multiprocessing import Process, Queue
import pandas as pd
import csv
KILL_TOKEN = 'STOP'
NUM_PROCESS_DATA = 14
NUM_PROCESS_PREAGGREGATE = 1
def ingest_data(ingestion_queue):
...pandas data munging
for blah in univariate_data.itertuples():
... write to ingestion_queue
def process_data(ingestion_queue, intermediate_results):
while True:
data = ingestion_queue.get()
if data == KILL_TOKEN:
break
... process data
... write to intermediate_results
def preaggregate_results(intermediate_results, output_queue):
while True:
data = intermediate_results.get()
if data == KILL_TOKEN:
break
... preaggregation
... write to output_queue after kill token is received
def process_output(output_queue):
while True:
data = output_queue.get()
if data == KILL_TOKEN:
break
... final aggregation
... write results
if __name__ == '__main__':
... the usual

ThreadPoolExecutor - how to return arguments

I need to parse around 1000 URLs. So far I have a function that returns a pandas dataframe after parsing the URL. How should I best structure the program so I can add all the dataframes together? I'm also unsure how to return arguments into 'futures'. In the below example, how can I eventually merge all temp dataframes into a single dataframe (i.e. finalDF=finalDF.append(temp)
import concurrent.futures
def Parser(ptf):
temp=pd.DataFrame()
URL="http://"+str(URL)
#..some complex operations, including a requests.get(URL) which returns eventually a temp: a pandas dataframe
return temp #returns a pandas dataframe
def conc_caller(ptf):
temp=Parser(ptf)
#this won't work because finalDF is not defined, unclear how to handle this
finalDF= finalDF.append(temp)
return df
booklist=['a','b','c']
finalDF=pd.DataFrame()
executor = concurrent.futures.ProcessPoolExecutor(3)
futures = [executor.submit(conc_caller, item) for item in booklist]
concurrent.futures.wait(futures)
Another problem is that I get the error message:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
Any suggestion how to fix the code are appreciated.

You have to protect your launch code with if __name__ == '__main__': to prevent creating processes forever.
Just before concurrent.futures.wait(futures)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multiprocessing for DataFrame Operations/Functions - python

You can try dask for multiprocessing a dataframe import dask.dataframe as dd partitions = 7 # cpu_cores - 1 ddf = dd.from_pandas(df, npartitions=partitions) ddf.map_partitions(lambda df: df.apply(unicodeHandle).compute(scheduler='processes') You can read more about dask here

Related

Multiprocessing is not executing parallel in Python

How to fix BrokenProcessPool: error for concurrent.futures ProcessPoolExecutor

python multiprocessing subprocess - high VIRT usage leads to memory error

Python: Reading from Queue slows the ability to write to Queue?

ThreadPoolExecutor - how to return arguments

Categories

Resources